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Uniform Central Limit Theorems 
Second Edition 


This work about probability limit theorems for empirical processes on general spaces, 
by one of the founders of the Field, has been considerably expanded and revised from the 
original edition. When samples become large, laws of large numbers and central limit 
theorems are guaranteed to hold uniformly over wide domains. The author gives a thor- 
ough treatment of the subject, including an extended treatment of Vapnik—Chervonenkis 
combinatorics, the Ossiander L2 bracketing central limit theorem, the Giné—Zinn boot- 
strap central limit theorem in probability, the Bronstein theorem on approximation of 
convex sets, and the Shor theorem on rates of convergence over lower layers. This 
new edition contains proofs of several main theorems not proved in the first edition, 
including the Bretagnolle—Massart theorem giving constants in the Komlos—Major— 
Tusnady rate of convergence for the classical empirical process, Massart’s form of 
the Dvoretzky—Kiefer—Wolfowitz inequality with precise constant, Talagrand’s generic 
chaining approach to boundedness of Gaussian processes, a characterization of uni- 
form Glivenko—Cantelli classes of functions, Giné and Zinn’s characterization of uni- 
form Donsker classes of functions (i.e., classes for which the central limit theorem 
holds uniformly, also uniformly over all probability measures P), and the Bousquet- 
Koltchinskii-Panchenko theorem that the convex hull of a uniform Donsker class is 
uniform Donsker. 

The book will be an essential reference for mathematicians working in infinite- 
dimensional central limit theorems, mathematical statisticians, and computer scientists 
working in computer learning theory. Problems are included at the end of each chapter 
so the book can also be used as an advanced text. 


R. M. DUDLEY is a Professor of Mathematics at the Massachusetts Institute of Technology 
in Cambridge, Massachusetts. 
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Preface to the Second Edition 


This book developed out of some topics courses given at M.I.T. and my lectures 
at the St.-Flour probability summer school in 1982. The material of the book 
has been expanded and extended considerably since then. At the end of each 
chapter are some problems and notes on that chapter. 

Starred sections are not cited later in the book except perhaps in other 
starred sections. The first edition had several double-starred sections in which 
facts were stated without proofs. This edition has no such sections. 

The following, not proved in the first edition, now are: (i) for Donsker’s the- 
orem on the classical empirical process a, := ./n(F, — F), and the Komlós- 
Major—Tusndady strengthening to give a rate of convergence, the Bretagnolle— 
Massart proof with specified constants; (ii) Massart’s form of the Dvoretzky— 
Kiefer—Wolfowitz inequality for œ, with optimal constant; (iii) Talagrand’s 
generic chaining approach to boundedness of Gaussian processes, which 
replaces the previous treatment of majorizing measures; (iv) characterization of 
uniform Glivenko—Cantelli classes of functions (from a paper by Dudley, Giné, 
and Zinn, but here with a self-contained proof); (v) Giné and Zinn’s character- 
ization of uniform Donsker classes of functions; (vi) Its consequence that uni- 
formly bounded, suitably measurable classes of functions satisfying Pollard’s 
entropy condition are uniformly Donsker; and (vii) Bousquet, Koltchinskii, and 
Panchenko’s theorem that convex hull preserves the uniform Donsker property. 

The first edition contained a chapter on invariance principles, based on a 
1983 paper with the late Walter Philipp. Some techniques introduced in that 
paper, such as measurable cover functions, are still used in this book. But, 
I have not worked on invariance principles as such since 1983. Much of the 
work on them treats dependent random variables, as did parts of the 1983 paper 
which Philipp contributed. The present book is mainly about the i.i.d. case. So 
I suppose the chapter is outdated, and I omit it from this edition. 


xi 
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xii Preface to the Second Edition 


For useful conversations and suggestions on topics in the book I’m glad 
to thank Kenneth Alexander, Niels Trolle Andersen, Miguel Arcones, Patrice 
Assouad, Erich Berger, Lucien Birgé, Igor S. Borisov, Donald Cohn, Yves 
Derrienic, Uwe Einmahl, Joseph Fu, Sam Gutmann, David Haussler, Jørgen 
Hoffmann-Jørgensen, Yen-Chin Huang, Vladimir Koltchinskii, the late Lucien 
Le Cam, David Mason, Pascal Massart, James Munkres, Rimas Norvaiša, the 
late Walter Philipp, Tom Salisbury, the late Rae Shortt, Michel Talagrand, Jon 
Wellner, He Sheng Wu, Joe Yukich, and Joel Zinn. I especially thank Denis 
Chetverikov, Peter Gaenssler and Franz Strobl, Evarist Giné, and Jinghua Qian, 
for providing multiple corrections and suggestions. I also thank Xavier Fernique 
(for the first edition), Evarist Giné (for both editions), and Xia Hua (for the 
second edition) for giving or sending me copies of expositions. 


Note 


Throughout this book, all references to “RAP” are to the author’s book Real 
Analysis and Probability, second edition, Cambridge University Press, 2002. 

Also, “A := B” means A is defined by B, whereas “A =: B” means B is 
defined by A. 
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1 


Donsker’s Theorem, Metric Entropy, 
and Inequalities 


Let P be a probability measure on the Borel sets of the real line R with 
distribution function F(x) := P((—oo, x]). Let X1, X2,..., be i.i.d. (indepen- 
dent, identically distributed) random variables with distribution P. For each 
n=1,2,..., and any Borel set A CR, let P,(A) := 1E” ôx (A), where 
6,(A) = 14(x). For any given X1, ..., Xn, P, is a probability measure called 
the empirical measure. Let F, be the distribution function of P,,. Then F, is 
called the empirical distribution function. 

Let U be the U[0, 1] distribution function U(x) = min(1, max(0, x)) for 
all x, so that U(x) = x for O < x < 1, U(x) = 0 for x < 0 and U(x) = 1 for 
x > 1. To relate F and U we have the following. 


Proposition 1.1 For any distribution function F on R: 


(a) For any y with O < y < 1, F~(y):=inf{x : F(x) > y} is well-defined 
and finite. 

(b) For any real x and any y with O < y < 1 we have F(x) > y if and only if 
x > F(y). 

(c) If V is a random variable having U[0, 1] distribution, then F~(V) has 
distribution function F. 


Proof. For (a), recall that F is nondecreasing, F(x) —> 0 as x — —oo, and 
F(x) > 1 as x => +0. So the set {x : F(x) > y} is nonempty and bounded 
below, and has a finite infimum. 

For (b), F(x) > y implies x > F< (y) by definition of F| (y). Conversely, 
as F is continuous from the right, F(F“(y)) > y, and as F is nondecreasing, 
x > F~(y) implies F(x) > y. 

For (c), and any x, we have by (b) 


Pr(F(V) < x) = Pr(V < F(x)) = U(F(x)) = F(x) 
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2 1 Donsker’s Theorem and Inequalities 


since 0 < F(x) < 1, so (c) holds. 


Recall that for any function f defined on the range of a function g, the 
composition f o g is defined by (f o g)(x) := f(g(x)). We can then relate 
empirical distribution functions F, for any distribution function F to those U,, 
for U, as follows. 


Proposition 1.2 For any distribution function F, and empirical distribution 
functions F, for F and U,, for U, U„ o F have all the properties of Fy. 


Proof. Let Vj, ..., V, be i.i.d. U, so that U, (t) = 1 pD ly,<t forO<t <1. 
Thus for any x, by Proposition 1.1(b) and (c), 


1 n 
U, (F = ly,<F& 
(F(x) oe vro 


1 n 
= > Lp-(v,)<x 
NS 

no 


n 
1 
T ) lyj<x 
n^ 

j=l 


where X; := F*(V;) are iid. (F). Thus U,(F(x)) has all properties of 
F,,(x). 


The developments to be described in this book began (in 1933) with the 
Glivenko—Cantelli theorem, a uniform law of large numbers. Probability dis- 
tribution functions can converge pointwise but not uniformly: for example, as 
n —> œ, 1-1/n,-+00)(*) > lio, +o)(x) for all x but not uniformly. 


Theorem 1.3 (Glivenko—Cantelli) For any distribution function F, almost 
surely, sup, |(Fa — F)(x)| > O as n > oo. 


Proof. By Proposition 1.2, and since U o F = F, it suffices to prove this 
for the U[0, 1] distribution U. Given € > 0, take a positive integer k such 
that 1/k < e/2. For each j =0,1,...,k, Un(j/k) > j/k as n > œ with 
probability 1 by the ordinary strong law of large numbers. Take no = no(@) 
such that for all n > no and all j = 0,1,...,k, |Un(j/k) — j/k| < €/2. Fort 
outside [0, 1] we have U,,(t) = U(t) = 0 or 1. For each ¢ € [0, 1] there is at 
least one j = 1,...,k such that (j — 1)/k < t < j/k. Then for n > no, 


(Gj — 1)/k — €/2 < Un — 1)/k) < Unt) < UnG Ak) < j/k + €/2. 


It follows that |U,(t)—t| < £, and since t was arbitrary, the theorem 


follows. 
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The next step was to consider the limiting behavior of a, := n! (F, — F) 


as n —> oo. For any fixed f, the central limit theorem in its most classical 
form, for binomial distributions, says that œ„(t) converges in distribution to 
N(O, F(t). — F(t))), in other words a normal (Gaussian) law, with mean 0 
and variance F(t)(1 — F(t)). 

In what follows (as mentioned in the Note after the Preface), “RAP” will 
mean the author’s book Real Analysis and Probability. 

For any finite set T of values of t, the multidimensional central limit theorem 
(RAP, Theorem 9.5.6) tells us that œ„(t) for t in T converges in distribution 
as n — oo to anormal law N(0, Cr) with mean 0 and covariance Cr(s, t) = 
F(s)(C — F(t)) fors <t. 

The Brownian bridge (RAP, Section 12.1) is a stochastic process y;(@) 
defined for 0 < ¢ < 1 and w in some probability space Q, such that for any 
finite set S$ C [0,1], y, for t in S have distribution N(0, C), where C = Cy 
for the uniform distribution function U(t) =t, 0 <t < 1. So the empirical 
process œ„ converges in distribution to the Brownian bridge composed with F, 
namely t +> yro, at least when restricted to finite sets. 

It was then natural to ask whether this convergence extends to infinite sets 
or the whole interval or line. Kolmogorov (1933b) showed that when F is 
continuous, the supremum sup, @,(f) and the supremum of absolute value, 
sup, |a@,(¢)|, converge in distribution to the laws of the same functionals of yp. 
Then, these functionals of yr have the same distributions as for the Brownian 
bridge itself, since F takes R onto an interval including (0, 1) and which 
may or may not contain 0 or 1; this makes no difference to the suprema 
since yọ = yı = 0. Also, y, —> 0 almost surely as t{0 or t f 1 by sample 
continuity; the suprema can be restricted to a countable dense set such as the 
rational numbers in (0, 1) and are thus measurable. 

To work with the Brownian bridge process it will help to relate it to the 
well-known Brownian motion process x;, defined for t > 0, also called the 
Wiener process. This process is such that for any any finite set T C [0, +00), 
the joint distribution of {x;};<7 is N(0, C) where C(s, t) = min(s, t). This 
process has independent increments, namely, for any 0 = to < ti < -+> < tk, 
the increments x;, — xr, for j = 1, ..., k are jointly independent, with x, — x; 
having distribution N(0, t — s) for O < s < t. Recall that for jointly Gaussian 
(normal) random variables, joint independence, pairwise independence, and 
having covariances equal to 0 are equivalent. Having independent increments 
with the given distributions clearly implies that E(x;x;) = min(s, t) and so is 
equivalent to the definition of Brownian motion with that covariance. 

Brownian motion can be taken to be sample continuous, i.e. such that t tb 
X;(@) is continuous in ¢ for all (or almost all) w. This theorem, proved by Norbert 
Wiener in the 1920s, is Theorem 12.1.5 in RAP; a proof will be indicated here. 
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4 1 Donsker’s Theorem and Inequalities 


If Z has N(O, 1) distribution, then for any c > 0, Pr(Z > c) < exp(—c?/2) 
(RAP, Lemma 12.1.6(b)). Thus if X has N(0, o°) distribution for some o > 0, 
then Pr(X > c) = Pr(X/o > c/o) < exp(—c?/(207)). It follows that for any 
n=1,2,... andany j = 1,2,..., 


1 
Pr (I = XGj-1/2"| Z =) < 2exp (—2”/(2n*)) s 


It follows that for any integer K > 0, the probability of any of the above 
events occurring for j = 1,...,2”K is at most 2”+! K exp(—2”/(2n*), which 
approaches 0 very fast as n —> oo, because of the dominant factor —2” in the 
exponent. Also, the series )*,, 1/ n? converges. It follows by the Borel—Cantelli 
Lemma (RAP, Theorem 8.3.4) that with probability 1, for all t € [0, K], for 
a sequence of dyadic rationals t, — t given by the binary expansion of t, x, 
will converge to some limit X;, which equals x, almost surely. Specifically, 
fort < K, lett, = (j — 1)/2” for the unique j < 2” K such that (j — 1)/2” < 
t <¢/2". Then ti41 = t = 2j/2"*! or 141 = (2j — 1)/2"*', so that t,+1 and 
t, are either equal or are adjacent dyadic rationals with denominator 2”+!, and 
the above bounds apply to the differences x,,,, — Xr- 

The process X, is sample-continuous and is itself a Brownian motion, as 
desired. From here on, a “Brownian motion” will always mean a sample- 
continuous one. 

Here is a reflection principle for Brownian motion (RAP, 12.3.1). A proof 
will be sketched. 


Theorem 1.4 Let {x;};+0 be a Brownian motion, b > 0 and c > 0. Then 
Pr(sup{x, : t < b} > c) = 2 Pr(xp > c) = 2N (0, b)([c, +00)). 


Sketch of proof: If sup{x, : t < b} > c, then by sample continuity there is a least 
time t with 0 < t < b such that x, = c. The probability that t = b is 0, so we 
can assume that t < b if it exists. Starting at time T, x, is equally likely to be > c 
or < c. [This holds by an extension of the independent increment property or the 
strong Markov property (RAP, Section 12.2); or via approximation by suitably 
normalized simple symmetric random walks and the reflection principle for 
them.] Thus 


1 
Pr(xp > c) = z Pr(sup{x; : t < b} > c), 


which gives the conclusion. 


One way to write the Brownian bridge process y, in terms of Brownian 
motion is y; = x, — tx;, O < t < 1. It is easily checked that this a Gaussian 
process (y; for t in any finite subset of [0, 1] have a normal joint distribution, 
with zero means) and that the covariance Ey, y, = s(1 — t) forO< 5s <t <1, 
fitting the definition of Brownian bridge. It follows that the Brownian bridge 
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1 Donsker’s Theorem and Inequalities 5 


process, on [0, 1], is also sample continuous, i.e., we can and will take it such 
that t > y,(@) is continuous for almost all w. 

Another relation is that y, is x, for O < t < 1 conditioned on x; = 0 ina 
suitable sense, namely, it has the limit of the distributions of {x;}o<;<1 given 
|x1| < £ as £ 0 (RAP, Proposition 12.3.2). A proof of this will also be sketched 
here. Suppose we are given a Brownian bridge {y;}o<;<1. Let Z be a N(0, 1) 
random variable independent of the y; process. Define &, = y, + tZ for0 < t < 
1. Then &, is a Gaussian stochastic process with mean 0 and covariance given, 
for0 < s <t < 1, by £é§ = s(1 — t)+ 0 + 0 + st = s, so é forO<t<1 
has the distribution of Brownian motion restricted to [0, 1]. The conditional 
distribution of &, given |&;| < £, in other words |Z| < €, is that of y, + tZ given 
|Z| < £, and since Z is independent of {y;}o<;<1, this conditional distribution 
clearly converges to that of {y;} as eļ0, as claimed. 

Kolmogorov evaluated the distributions of sup, y; and sup, |y:| explicitly. 
For the first (1-sided) supremum this follows from a reflection principle (RAP, 
Proposition 12.3.3) for y, whose proof will be sketched: 


Theorem 1.5 For a Brownian bridge {y;}o<:<1 and any c > 0, 


Pr( sup y, > c) = exp(—2c’). 


O<r<1 


Sketch of proof: The probability is, for a Brownian motion x;, 


tim Pr( sup x; > c|lx1| < e) 
e}0 


O<r<1 


= limPr | sup x, >c and |x| < e) / Pr(|xi| < £) 
e40 O<r<1 


= im Pr ( sup x, > c and |x; — 2c| < e) / Pr(|xi| < £) 
E 0<t<1 

where the last equality is by reflection. For ¢ small enough, € < c, and then 
the last quotient becomes simply Pr(|x; — 2c| < €)/Pr(|x1| < £). Letting ¢ 
be the standard normal density function, the quotient is asymptotic as ¢|0 to 
(2c) - 2e/(@(0) - 2e) = exp(—2c’) as stated. 


The distribution of supp ,<; |r| is given by a series (RAP, Proposition 
12.3.4) as follows: 


Theorem 1.6 For any c > 0, and a Brownian bridge y,, 


CO 
Pr( sup |y;| > e) = 2X (=1) 7 exp(-2j7c’). 
O<t<1 = 
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The proof is by iterated reflections: for example, a Brownian path which before 
time | reaches +c, then later —c, then returns to (near) 0 at time 1, corresponds 
to a path which reaches c, then 3c, then (near) 4c, and so on. 

Doob (1949) asked whether the convergence in distribution of empirical 
processes to the Brownian bridges held for more general functionals (other 
than the supremum and that of absolute value). Donsker (1952) stated and 
proved (not quite correctly) a general extension. This book will present results 
proved over the past few decades by many researchers, first in this chapter on 
speed of convergence in the classical case. In the rest of the book, the collection 
of half-lines (—oo, x], x € R, will be replaced by much more general classes 
of sets in, and functions on, general sample spaces, for example, the class of 
all ellipsoids in R?. 

To motivate and illustrate the general theory, the first section will give a 
revised formulation of Donsker’s theorem with a statement on rate of conver- 
gence, to be proved in Section 1.4. Sections 1.2 on metric entropy and 1.3 on 
inequalities provide concepts and facts to be used in the rest of the book. 


1.1 Empirical Processes: The Classical Case 


In this section, a form of Donsker’s theorem with rates of convergence will be 
stated for the U[0, 1] distribution with distribution function U and empirical 
distribution functions U,,. This would imply a corresponding limit theorem for 
a general distribution function F via Proposition 1.2. Let a, := n'/2(U, — U) 
on [0, 1]. It will be proved that as n —> oo, a, converges in law (in a sense 
to be made precise below) to a Brownian bridge process y,, 0 < t < 1 (RAP, 
before Theorem 12.1.5). 

Donsker in 1952 proved that the convergence in law of a, to the Brownian 
bridge holds, in a sense, with respect to uniform convergence in ft on the whole 
interval [0, 1]. How to define such convergence in law correctly, however, was 
not clarified until much later. General definitions will be given in Chapter 3. 
Here, a more special approach will be taken in order to state and prove an 
accessible form of Donsker’s theorem. 

For a function f on [0, 1] we have the sup norm 


lf llsup := sup{|f(t)|: O<t < |}. 
Here is a formulation of Donsker’s theorem. 
Theorem 1.7 Forn = 1,2,..., there exist probability spaces Q, such that: 


(a) On Qn, there exist n i.i.d. random variables X\,..., Xn with uniform 
distribution in [0, 1]. Let a, be the nth empirical process based on these X;; 
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1.2 Metric Entropy and Capacity 7 


(b) On Q, a sample-continuous Brownian bridge process Y,: (t, œ) œ> Y,(t, œ) 
is defined; 

(c) l&n — Yallsup is measurable, and for all € > 0, Pr(|lon — Yallsup > €) > 0 
asn —> œ. 


The theorem just stated is a consequence of the following facts giving rates 
of convergence. Komlós, Major, and Tusnády (1975) stated a sharp rate of 
convergence in Donsker’s theorem, namely, that on some probability space 
there exist X; i.i.d. U [0, 1] and Brownian bridges Y,, such that 


l f 
P ( sup |(@, — Y,)(¢)| > et) < Ke (1.1) 
0<t<1 Jn 
for all n = 1,2,... and x > 0, where c, K, and à are positive absolute con- 


stants. If we take x = alogn for some a > 0, so that the numerator of the 
fraction remains of the order O(log n), the right side becomes K n> , decreas- 
ing as n —> œ as any desired negative power of n. 

More specifically, Bretagnolle and Massart (1989) proved the following: 


Theorem 1.8 (Bretagnolle and Massart) The approximation (1.1) of empi- 
rical processes by Brownian bridges holds with c = 12, K = 2, and à = 1/6 
forn > 2. 


Bretagnolle and Massart’s theorem is proved in Section 1.4. 


1.2 Metric Entropy and Capacity 


The notions in this section will be applied in later chapters, first, to Gaussian 
processes, then later in adapted forms, metric entropy with inclusion for sets, or 
with bracketing for functions, as applied to empirical processes in later chapters. 

The word “entropy” is applied to several concepts in mathematics. What 
they have in common is apparently that they give some measure of the size or 
complexity of some set or transformation and that their definitions involve loga- 
rithms. Beyond this rather superficial resemblance, there are major differences. 
What are here called “metric entropy” and “metric capacity” are measures 
of the size of a metric space, which must be totally bounded (have compact 
completion) in order for the metric entropy or capacity to be finite. Metric 
entropy will provide a useful general technique for dealing with classes of sets 
or functions in general spaces, as opposed to Markov (or martingale) methods. 
The latter methods apply, as in the last section, when the sample space is R and 
the class C of sets is the class of half-lines (—co, x], x € R, so that C with its 
ordering by inclusion is isomorphic to R with its usual ordering. 
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8 1 Donsker’s Theorem and Inequalities 


Let (S, d) be a metric space and A a subset of S. Let € > 0. A set F CS 
(not necessarily included in A) is called an e-net for A if and only if for each 
x € A, thereisa y € F with d(x, y) < £. Let N(e, A, S, d) denote the minimal 
number of points in an e-net in S for A. 

For any set C C S, define the diameter of C by 


diamC := sup{d(x, y): x,y € C}. 


Let N(e, C, d) be the smallest n such that C is the union of n sets of diameter 
at most 2e. Let D(e, A, d) denote the largest n such that there is a subset F C A 
with F having n members and d(x, y) > € whenever x Æ y for x and y in F. 

The three quantities just defined are related by the following inequalities: 


Theorem 1.9 For any £ > 0 and set A ina metric space S with metric d, 
D(2e, A,d) < N(e, A,d) < N(e, A,S,d) < Me, A, A,d) < D(e, A, d). 


Proof. The first inequality holds since a set of diameter 2e can contain at 
most one of a set of points more than 2e apart. The next holds because any 
ball B(x, €) := {y: d(x, y) < €} is a set of diameter at most 2e. The third 
inequality holds since requiring centers to be in A is more restrictive. The last 
holds because a set F of points more than € apart, with maximal cardinality, 
must be an é-net, since otherwise there would be a point more than £ away 
from each point of F, which could be adjoined to F, a contradiction unless F 
is infinite, but then the inequality holds trivially. 


It follows that as ¢|0, when all the functions in the Theorem go to œo unless 
S is a finite set, they have the same asymptotic behavior up to a factor of 2 
in £. So it will be convenient to choose one of the four and make statements 
about it, which will then yield corresponding results for the others. The choice 
is somewhat arbitrary. Here are some considerations that bear on the choice. 

The finite set of points, whether more than € apart or forming an e-net, are 
often useful, as opposed to the sets in the definition of N (£, A, d). N(e, A, S, d) 
depends not only on A but also on the larger space S. Many workers, possibly for 
these reasons, have preferred N(e, A, A, d). But the latter may decrease when 
the set A increases. For example, let A be the surface of a sphere of radius € 
around 0 in a Euclidean space S and let B := A U {0}. Then M(e, B, B, d) = 
1 < Me, A, A, d). This was the reason, apparently, that Kolmogorov chose to 
use N(e, A, d). 

In this book I adopt D(e, A, d) as basic. It depends only on A, not on the 
larger space S, and is nondecreasing in A. If D(e, A, d) = n, then there are n 
points which are more than € apart and at the same time form an e-net. 

Now, the ¢-entropy of the metric space (A, d) is defined as H(e, A, d) := 
log N(e, A, d), and the €-capacity as log D(e, A, d). Some other authors take 
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logarithms to the base 2, by analogy with information-theoretic entropy. In this 
book logarithms will be taken to the usual base e, which fits, for example, with 
bounds coming from moment generating functions as in the next section, and 
with Gaussian measures as in Chapter 2. There are a number of interesting sets 
of functions where N(e, A, d) is of the order of magnitude exp(e~") as e{0, 
for some power r > 0, so that the e-entropy, and likewise the e-capacity, have 
the simpler order £”. But in other cases below, D(e, A, d) is itself of the order 
of a power of 1 /e. 


1.3 Inequalities 


This section collects several inequalities bounding the probabilities that ran- 
dom variables, and specifically sums of independent random variables, are 
large. Many of these follow from a basic inequality of S. BernStein and P. L. 
Chebyshev: 


Theorem 1.10 For any real random variable X and t € R, 


Pr(X > t) < infe™ Ee. 
u>=0 


Proof. For any fixed u > 0, the indicator function of the set where X > t 
satisfies l{x>} < e!%—-) so the inequality holds for a fixed u, then take 
inf„>0 - 

For any independent real random variables X,,..., Xn, let S, := Xi +--- 
+X). 


Theorem 1.11 (BernStein’s inequality) Let X,, X2,..., Xn be independent 
real random variables with mean 0. Let0 < M < œ and suppose that |X ;| < 
M almost surely for j =1,...,n. Let o? = var(X;) and oe := var(Sy,) = 
o? + see +o. Then for any K > 0, 


Pr{|Sp| > Kn!?} < 2-exp(—nK?/(2t? + 2Mn'/?K/3)). (1.2) 


Proof. We can assume that t? > 0 since otherwise S, = 0 a.s. and the inequality 
holds. For any u > O and j = 1,...,n, 


EexpuX;) = 1+wo?Fj/2 < explo? Fju’/2) (1.3) 


where F; := 207° yg EX rl, or Fj=0 if o?=0. For r> 
2, |X; < XM? a.s., so Fj < 2X2, (Muy ™?/r! < Y2, (Mu/3y? = 
1/(1 — Mu/3) for all j = 1,...,nif0 < u < 3/M. 

Let v := Kn!’ and u := v/(t? + Mv/3), so that v = t2u/(1 — Mu/3). 
Then O < u < 3/M. Thus, multiplying the factors on the right side of (1.3) by 
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10 1 Donsker’s Theorem and Inequalities 
independence, we have 
Eexp(uS,) < exp(t2u*/2(1 — Mu/3)) = exp(uv/2). 


So by Theorem 1.10, Pr{S, > v} < e™"”/ and 


e/? — exp(—v?/(2t? +2Mv/3)) = exp(—nK?/(2t? + 2MKn"/"/3)). 


Here are some remarks on BernStein’s inequality. Note that for fixed K and 
M, if X; are iid. with variance o”, then as n — oo, the bound approaches the 
normal bound 2: exp(— K 2 / (207)), as given in RAP, Lemma 12.1.6. Moreover, 
this is true even if M := M, — ooasn — œ while K stays constant, provided 
that M,,/n'/* — 0. Sometimes, the inequality can be applied to unbounded 
variables X ;, replacing them by truncated ones, say replacing X; by fm,(X;) 
where f(x) := x1,\x)<my. In that case the probability 


n 
Pr(|Xj| > M, forsome j <n) <)> Pr(|Xj| > Mn) 
j=l 


needs to be small enough so that the inequality with this additional probability 
added to the bound is still useful. 

Next, let s1, 52,..., be i.i.d. variables with P(s; = 1) = P(s; = —1) = 1/2. 
Such variables are called “Rademacher” variables. We have the following 
inequality: 


Proposition 1.12 (Hoeffding) For any t > 0, and real a; not all 0, 
Pr yas; >t! < exp| —?’/ 25 a? 
j=l 


Proof. Since 1/(2n)! <2-"/n! for n =0,1,..., we have coshx = (e* + 


e~*)/2 < exp(x?/2) for all x. Applying Theorem 1.10, the probability on the 


left is bounded above by inf, exp(—ut + ae abu? /2), which by calculus is 


attained at u = t/ X` j a?, and the result follows. 


Proposition 1.12 can be applied as follows. Let Y1, Y2, ..., be independent 
variables which are symmetric, in other words Y; has the same distribution as 
—Y; forall j. Let s; be Rademacher variables independent of each other and of 
all the Y;. Then the sequence {s; Y;};;>1; has the same distribution as {Y; }j>1}- 
Thus to bound the probability that =i Y; > K, for example, we can consider 
the conditional probability for each Yj, ..., Yn, 


Py ae K|Yi, s Yn} < exp(—K?/(2 yj Y?) 
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by Proposition 1.12. Then to bound the original probability, integrating over 
the distribution of the Y;, one just needs to have bounds on the distribution of 
Xa Y a which may simplify the problem considerably. 

The Bernštein inequality (Theorem 1.11) used variances as well as bounds 
for centered variables. The following inequalities, also due to Hoeffding, use 
only bounds. They are essentially the best that can be obtained, under their 


hypotheses, by the moment generating function technique. 


Theorem 1.13 (Hoeffding) Let X,,..., X, be independent random variables 
such that O < X; < 1 for all j. Let X= (Xp +--+ X,)/n and u := EX. 
Then for < t < 1 — u, 


+t = l—p-t i P 
Pr{X —pw>t}< (E) (=) | < ee) 
~~ |\w+t ree = 


where 


gu) = (1-2)! log(1—)/u) for 0< p< 1/2, 
= Y@nd—p)) for 1/2< <1. 
For allt > 0, 
P(X — u > t) < e”, (1.4) 


Remark. For t < 0, the given probability would generally be of the order of 
1/2 or larger, so no small bound for it would be expected. 

Proof. For any v > 0, the function f(x) := e”* is convex (its second derivative 
is positive), so any chord lies above its graph (RAP, Section 6.3). Specifically, if 
0 <x <1, then e*™* < 1 — x + xe”. Taking expectations gives E exp(vXj;) < 
1— uj + uje”, where uj := EX;. (Note that the latter inequality becomes 
an equation for a Bernoulli variable Xj, taking only the values 0, 1.) Let 
Sy i= X1 +--- + Xn. Then 


Pr(X — u >t) = Pr(S, — ES, > nt) < Eexp(v(S, — ES, —nt)) 
= e TAT") E exp(uX ;) < e™® OHOIN" 4 (1 — u; + uje”). 


To continue the proof, the following well-known inequality (e.g., RAP (5.1.6) 
and p. 276) will be useful: 


Lemma 1.14 For any nonnegative real numbers t, ... , tn, the geometric mean 
is less than or equal to the arithmetic mean, in other words, 


(theta) < (thes +ty)/n. 
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Applying the Lemma gives 


n 


_ 1 
Pr{X — u >t} < oe : bee — uj + uje” 
j=l 


< eet] _ u+ ue”. 


To find the minimum of this for v > 0, note that it becomes large as v > oo 
since t + u < 1, while setting the derivative with respect to v equal to 0 
gives a single solution, where 1 — u + pe” = pe’/(t+y) and e” = 1+ 
t/(u(1 — u — t)). Substituting these values into the bounds gives the first, 
most complicated bound in the statement of the theorem. This bound can be 
written as Pr(X — u > t) < exp(—nt?G(t, [L)), where 

u+t u+t l—-u-t l- u-t 
opo = gH) + (=A) (A=). 
(t, u) zz log r + 72 og [=g 
The next step is to show that ming<;<1 G(t, u) = g() as defined in the 
statement of the theorem. For 0 < x < | let 


H(x) := (1 z Z) ioga — x). 


In 0G(t, )/dt, the terms not containing logarithms cancel, giving 


29G(t, u) 2 t 
Pp =i a w)}oe(1 ~*~) 


- fi- F+0)toe(1- =) = a. Ho 


To see that H is increasing in x for 0 < x < 1, take the Taylor series of 
log(1 — x) and multiply by 1 — 2 to get the Taylor series of H around 0, 
all whose coefficients are nonnegative. Only the first order term is 0. Thus 
dG/dt > 0 if and only if t/(1 — u) > t/(u + t), or equivalently t > 1 — 2u. 
So if u < 1/2, then G(t, u), for fixed u, has a minimum with respect to t > 0 
att = 1 — 2y, giving g(u) for that case as stated. Or if u > 1/2, then G(r, u) 
is increasing in t > 0, with lim;;9 G(t, u) = g(u) as stated for that case, using 
the first two terms of the Taylor series around t = 0 of each logarithm. 

To prove (1.4) for O < t < 1 — n, it will be shown that the minimum of 
g(u) for O < u < 1 is 2. For u > 1/2, g is increasing (its denominator is 
decreasing), and g(1/2) = 2. For u < 1/2, letting w := 1 — 2u we get g(u) = 
Ł log(j**). From the Taylor series of log(1 + w) and log(1 — w) around w = 
0, we see that g is increasing in w, and so decreasing in jz, and converges to 2 
as  — 1/2. Thus g has a minimum at u = 1/2, which is 2. 

To prove (1.4) for O < t = 1 — u, consider 0 < s < t and let s fî t. For 
t > 1-— u, Pr(X — u > t) < P(X > 1)=0 < exp(—2nt?). So (1.4) is proved 
for all t > 0 and so is the theorem. 
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For the empirical measure P,,, if A is a fixed measurable set, nP,,(A) is 
a binomial random variable, and in a multinomial distribution, each n; has a 
binomial distribution. So we will have need of some inequalities for binomial 
probabilities, defined by 


B(k,n, p) = Yo (Joe Ue ge l=pel 
o<j<k J 
n\ jynnji 
E(k,n, p) := a Ip". 
k<j<n J 


Here k is usually, but not necessarily, an integer. Thus, in n independent trials 
with probability p of success on each trial, so that q is the probability of 
failure, B(k, n, p) is the probability of at most k successes, and E(k, n, p) is 
the probability of at least k successes. 


Theorem 1.15 (Chernoff) We have 


np\k nq k , 
(2) ( ) if k> np, (1.5) 


E(k,n, p) 


IA 


k n—k 
B(k,n, p) < exp(—(np—k)’/Qnpq)) if k<np<n/2. (1.6) 


Proof. These facts follow directly from the Hoeffding inequality Theorem 1.13. 
For (1.6), note that B(k, n, p) = E(n — k,n, 1 — p) and apply the g(jz) case 
with u = 1 — p. 

If in (1.5) we set x := nq /(n — k) < e*7! it follows that 


E(k,n, p) < (np/kk e"? if k> np. (1.7) 
The next inequality is for the special value p = 1/2: 


Proposition 1.16 If k < n/2, then 2"B(k,n, 1/2) < (ne/k¥. 


Proof. By (1.5) and symmetry, B(k, n, 1/2) < (n/2)"k-*(n — ky", Letting 
y := n/(n — k) < e7! then gives the result. 


A form of Stirling’s formula with error bounds is: 


Theorem 1.17 Forn = 1,2,..., e!/(2"t) < n!(e/n} (2rny < eln, 


Proof. See Feller (1968), vol. 1, Section II.9, p. 54. 


For any real x let xt := max(x,0). A Poisson random variable z with 
parameter m has the distribution given by Pr(z = k) = e™m*/k! for each 
nonnegative integer k. 
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Lemma 1.18 For any Poisson variable z with parameter m > 1, 
E(z—m)t > m'/7/8. 


Proof. We have E(z — m)* = } pm e™m*(k — m)/k!. Let j := [m], mean- 
ing j is the greatest integer with j < m. Then by a telescoping sum (which is 
absolutely convergent), E(z — m)* = e~m/*!/j!. Then by Stirling’s formula 
with error bounds (Theorem 1.17), 


E(z —m)* > e™mi + (eJ j} (2r jy re Ven 


> (mit) jit aye 13/20Qq7)-1/2 > m2 78, 
In the following two facts, let X1, X2,..., Xn be independent random 
variables with values in a separable normed space S with norm || ||. Let 


Sj = X,+---+X; forj =1,...,n. 


Theorem 1.19 (Ottaviani’s inequality). If for some a > 0 and c with 0 < c < 
1, we have P(||S, — S;|| > œ) < c forall j =1,...,n, then 


P{max ||Sj|| = 2a) < P(IISyl| = 0/0 = 0. 


Proof. The proof in RAP, 9.7.2, for S = R*, works for any separable normed 
S. Here (x, y) +> ||x — y|] is measurable: § x $> R by RAP, Proposi- 
tion 4.1.7. 

When the random variables X ; are symmetric, there is a simpler inequality 
due to P. Lévy: 


Theorem 1.20 (Lévy’s inequality) Given a probability space (Q, P), let Y be 
a countable set, let X1, X2,..., be stochastic processes defined on Q indexed 
by Y; in other words, for each j and y € Y, X j(y)(-) is a random variable on 
Q. For any bounded function f on Y let || f \|y := sup{| fQ)|: y € Y}. Suppose 
that the processes X ; are independent with ||X j|\ly < œ a.s., and symmetric; 
in other words, for each j, the random variables {—X j(y): y € Y} have the 
same joint distribution as {X j(y): y € Y}. Let S, := Xı +-+- + Xn. Then for 
each n, and M > 0, 


P (max Silly > m) < 2P (Sally > M). 
Jan 


Note. The norm on a separable Banach space (X, || - ||) can always be written 
in the form || - ||y for Y countable, via the Hahn—Banach theorem: apply RAP, 
Corollary 6.1.5, to a countable dense set in the unit ball of X to get a countable 
subset Y in the dual space X’ of X which is norming, i.e. on X, ||°|| = Illy. 
Here X’ may not be separable. On the other hand, the above Lemma applies 
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to some nonseparable Banach spaces: the space of all bounded functions on an 
infinite Y with supremum norm is itself nonseparable. 


Proof. Let Mj;(@) := maxj<x ||S;|ly. Let Cy be the disjoint events {M;_1 
<M < Mk}, k=1,2,..., where we set Mọ := 0. Then for 1 <m < 
n, 2\|Smlly < Saly + l2Sm — Sally. So if ||Smlly > M, then ||S, lly > M or 
2Sin — Snlly > M or both. The transformation which interchanges X; and 
—Xj; just for m < j <n preserves probabilities, by symmetry and indepen- 
dence. Then S, is interchanged with 2S, — Sn, while X ; are preserved for j < 
m. So P(Cm N {l] Sully > M}) = P(Cn a {]2.Sin = Sally > M}) = P(Cm)/2, 
and P(M, > M) = $; P(Cm) < 2P(|Snlly > M). 


m=1 = 


1.4 *Proof of the Bretagnolle-Massart Theorem 


The proof is based essentially on some lemmas, which G. Tusnady stated, on 
approximating symmetric binomial random variables by normal variables. Let 
Bn, 1/2) denote the symmetric binomial distribution for n trials. Thus if B, 
has this distribution, B,, is the number of successes in n independent trials with 
probability 1/2 of success on each trial. For any distribution function F, recall 
Proposition 1.1 and the quantile function F~ from (0, 1) into R defined and 
used in it. If F is continuous and strictly increasing on some interval [a, b], with 
F(a) = 0 and F(b) = 1, then F“ = F~! from (0, 1) onto (a, b). In general, 
whereas F is right-continuous, F“ is always left-continuous. 

Here is one of Tusnddy’s lemmas (Lemma 4 of Bretagnolle and Massart 
1989). Its proof will be completed in Subsection 1.4.2. 


Lemma 1.21 Let ® be the standard normal distribution function and Y a stan- 
dard normal random variable. Let ®, be the distribution function of B(n, 1/2) 
and set Cn := ®,*(®(Y)) —n/2. Then 


Cal < 1+G/n/2)1¥|, (1.8) 
[Cn —(/n/2)¥| < 1+Y7/8. (1.9) 


By Proposition 1.1(c), if V has a U[0, 1] distribution, F“ (V) has distribution 
function F. If F is continuous, we have the following in the other direction: 


Proposition 1.22 Let X be a real random variable with distribution function 


F. If F is continuous, then F(X) has a U[O, 1] distribution. 


Proof. For0 < y <1, F-(fy}) is a closed interval [a, b] by continuity. (It is 
a singleton, a = b, usually, but not necessarily for all y.) We have P( F(X) = 
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y)= Pla = X < b)= F(b) — F(a—) where F(a—) = lim,y, F(u) = F(a) 
because F is continuous. Now F(b) — F(a) = y — y = 0. Thus 


P(F(X) < y) = P(F(X) < y) = 1 — P(F(X) > y)= 1 — P(X > F*(y)) 
= l] — P(X > a) = P(X < a) = F(a—) = F(a) = y, 


again by continuity of F, so indeed F(X) has a U[0, 1] distribution, completing 
the proof. 


Thus in Lemma 1.21, ®(Y) has a U[0, 1] distribution and ©, ~(®(Y)) has 
the symmetric binomial distribution B(n, 1/2). Lemma 1.21 will be shown (by 
a relatively short proof) to follow from: 


Lemma 1.23 Let Y be a standard normal variable and let B, be a binomial 
random variable with distribution B(n, 1/2). Then for any integer j such that 
0 < j <nandn + jis even, we have 


P(By = (n+ j)/2) = P(Vn¥/2>n-V1—j/n)), (1.10) 
P(B, > (n+ j)/2) < P(VnY/2 > (j —2)/2). (1.11) 


The following form of Stirling’s formula with remainder is used in the proof 
of Lemma 1.23. 


Lemma 1.24 Let n! = (n/e)"/2mnA, where A, = 1 + B,/(12n), which 
defines A, and B, for n = 1,2, .... Then n} l asn > œœ. 


1.4.1 Stirling’s Formula: Proof of Lemma 1.24 


It can be checked directly that 6B; > By > --- > Bg > 1. So it suffices to prove 
the lemma for n > 8. We have A, = exp((12n)~! — 8n /(360n?)) where 0 < 
0, < 1; see Whittaker and Watson (1927, p. 252) or Nanjundiah (1959). Then 
by Taylor’s theorem with remainder, 


An = {1 l 1 1 1/12n 6, /(360n2 
uo X 12n + 288n2 T sany E exp(—9, /(360n")) 


where 0 < ¢, < 1. Next, 


1 
Bn+1 < 12(n + 1) [ex (aa) = 1 


1 1 
< 1 1/02040) 
taath 6020+ DP 
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from which lim sup,_,,, Bn < 1, and 


1 1 


Ba = 12nfA,n —1] > 12n Ie bt ES 1/(360n3)) — i) 


Using e* > 1 — x gives 
12 : + : : 1+ : F : 
"| n 28872 360n3 12n = 288n? 


1 1 1 1 
1 1 . 
t 24n 30n? ( T 12n T se) 
Thus liminf,... n > 1 and n —> 1 as n > œo. To prove n > n+ı for 
n > 8 it will suffice to show that 


Bn 


IV 


TT eee ee : i+ s+ : | 
Am4 D 6-144n2 © "24n 302| "96 288-82 
or 
e1/108 1 1 1 1 
6.1442 | 30n2 [+t a] = mari 


or that 0.035/n? < 1/[24n(n + 1)] or 0.84 < 1 — 1/(n + 1), which holds for 
n > 8, proving that f, decreases with n. Since its limit is 1, Lemma 1.24 is 
proved. 


1.4.2 Proof of Lemma 1.23 
First, (1.10) will be proved. For any i = 0,1,..., such that n + i is even, 
let k := (n+i)/2 so that k is an integer, n/2 < k <n, and i = 2k — n. 
Let pni := P(B, = (n + i)/2) = P(B, = k) = (¢)/2" and x; := i/n. Define 


Pni := Oforn + i odd. The factorials in (¥) will be approximated via Stirling’s 
formula with correction terms as in Lemma 1.24. To that end, let 
1 12 
CS(u,v,w,x,n) := i a , 


(1+ v/[6n(1 — x))C + w/[6nC1 + x)]) 
By Lemma 1.24, we can write for 0 <i < n and n + i even 
Pai = CS(xi, n)y2/mrn exp(—ng(x;)/2 — (1/2)log(1 — x?)) (1.12) 


where g(x) := (1+ x)log(1 +x)+ (1 — x)log(1 — x) and CS(x;,n) := 
CS(Bn, n-k, Bk, Xi, n). By Lemma 1.24 and since k > n/2, 


1+ := 1.013251 > 12(e(2m)~'/? — 1) = Bi > Bae > Be > Br > 1. 
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Thus, for x := x;, by clear or easily checked monotonicity properties, 


CS(x,n) < CS(Bn, Pk, Pk, x, n) 

Bn Br B o T' 
(1 E 2) [i i 3n(1 — x?) ij 36n? (1 — =| 
CS(Bn, Be, Be, 9,2) < CS(Bn, Bn, Bn, 0, n) 


1 1 1 J 
Csa,1,1,0,n) = (1+ — lie . 
( " ( “all aed 


It will be shown next that log(1 + y) — 2log(1 + 2y) < —3y + 7y?/2 for y > 
0. Both sides vanish for y = 0. Differentiating and clearing fractions, we get a 
clearly true inequality. Setting y := 1/(12n) then gives 


IA 


IA 


log CS(x;,n) < —1/(4n) + 7/(288n7). (1.13) 


To get a lower bound for C S(x, n) we have by an analogous string of inequalit- 


ies 
csa.n) > (14 ) i ee j (1.14) 
ere te ( 12n | 3n(1 — x?) 36n2(1 — =} aaa 


The inequality (1.10) to be proved can be written as 
XO Pai = 1- O2Vn1 — y1- j/n)). (1.15) 
i=j 
When j = 0 the result is clear. When n < 4 and j =n or n — 2 the result can 
be checked from tables of the normal distribution. Thus we can assume from 
here on 


n>5. (1.16) 


P(Y >t) < (tv2ry! exp(—t?/2), e.g. Dudley (2002), Lemma 12.1.6(a). 
Then (1.15) follows easily when j = n and n > 5. To prove it for j = n — 2 it 
is enough to show 


Case 1. Let j? > 2n, in other words x; = /2/n. Recall that for t > 0 we have 


n(2 —log2) — 4V2n + log(n + 1) + 4 + log[2 V 2r (vn — V2)] >0, n>5. 


The left side is increasing inn for n > 5 andis > Oatn = 5. 

For 5 < n < 7 we have (n — 4)” < 2n, so we can assume in the present case 
that 2n < j? < (n — 4) and n > 8. Let y; := 2/n(1 — yI —i/n). Then it 
will suffice to show 


Ji+2 
Pni > pwdu, i= j,j+2,...,n—4, (1.17) 


Ji 
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where ¢ is the standard normal density function. Let 


fax) := yn/27(1 — x) exp(—2n0 — V1 zy): (1.18) 
By the change of variables u = 2,/n(1 — V1 — x), (1.17) becomes 


Xi+2 


Pni Z fr(x)dx. (1.19) 


Xi 


Clearly f, > 0. To see that f,(x) is decreasing in x for /2/n < x < 1 — 4/n, 
note that 


21 = x) fl/fa = 1-4n[V1 — x — 1 +x], 


so f, is decreasing where vI — x — (1 — x) > 1/(4n). We have J/y-yey 
for y < 1/4, so /y — y > 1/(4n) for 1/(4n) < y < 1/4. Let y := 1-—x. 
Also VI — x — (1 — x) > x/4forx < 8/9,so /1 — x — (1 — x) > 1/(4n) for 
1/n <x < 8/9. Thus VI = x — (1 — x) > 1/(4n) for 1/n < x < 1 — 1/(4n), 
which includes the desired range. Thus to prove (1.19) it will be enough to 
show that 


Pri = (2/n) fami), i=j, j+2,...,n— 4. (1.20) 
So by (1.12) it will be enough to show that for ./2/n <x <1 — 4/n and 
n> 8, 
CS(x,n\1 +x)! exp[n{4l — V1 — x) — g(x)}/2] = 1. (1.21) 
Let 
I(x) = 40 — V1 = xy — g(x). (1.22) 


Then J is increasing for 0 < x < 1, since its first and second derivatives are 
both 0 at 0, while its third derivative is easily checked to be positive on (0, 1). 
In light of (1.14), to prove (1.21) it suffices to show that 


i= e702 > SIi+x(1+ ui + ary (1.23) 
12n B 3n(1—x?) 362d -x3 


When x < 1—4/n and n > 8 the right side is less than 1.5, using first 
VI+x< J/2, next x < 1 — 4/n, and lastly n > 8. For x > 0.55 and n > 8 
the left side is larger than 1.57, so (1.23) is proved for x > 0.55. We will next 
need the inequality 


I(x) > x3/24+7x4/48, 0< x <0.55. (1.24) 


To check this one can calculate J(0) = J’(0) = J’(0) =0, J®(0) = 3, 
J®(0) = 7/2, so that the right side of (1.24) is the Taylor series of J around 
0 through fourth order. One then shows straightforwardly that J©(x) > 0 for 
O<x <1. 
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20 1 Donsker’s Theorem and Inequalities 


It follows since nx? > 2 and n>8 that nJ(x)/2 > x/2+7/24n. Let 
K(x) := exp(x /2)//1 + x and K(x) := (K(x)— 1)/x?. We will next 
see that «(-) is decreasing on [0,1]. To show «’ <0 is equivalent to 
e*/714 + 4x — x?] > 4(1 + x)’, which is true at x = 0. Differentiating, we 
would like to show e*/?[6 — x? /2] > 6/1 + x, or squaring that and multiply- 
ing by 4, e* (144 — 24x? + x4) > 144(1 + x). This is true at x = 0. Differenti- 
ating, we would like to prove e*(144 — 48x — 24x? + 4x3 + x4) > 144. Using 
e* > | + x and algebra gives this result for O < x < 1. 

It follows that K(x) > 1 +.0.3799/n when ./2/n < x < 0.55. It remains 
to show that for x < 0.55, 

1 0.3799 \ 7 /24n) It (it)? 
(1+) (+ n )e 2 i+ sax * SR =a 
At x = 0.55 the right side is less than 1 + 0.543/n, so Case 1 is completed 

since 0.543 < 1/12 + 0.3799 + 7/24. 


Case 2. The remaining case is j < /2n. For any integer k, P(6, > k) = 
1 — P(n < k — 1). Fork = (n+ j)/2 we have k — 1 = (n+ j — 2)/2. If n is 


odd, then P(n > n/2) = 1/2 = P(Y > 0). If n is even, then P(6, > n/2) — 


Pno/2 = 1/2 = P(Y > 0). So, since po = 0 for n odd, (1.10) is equivalent to 
1 
5 Pmt J, Pu S POSY <2Jn(1— I= j/n)). (1.25) 
0<i<j—2 


Given j < V2n, a family Jo, ,..., Ig of adjacent intervals will be defined 
such that for n odd, 


Pu < P(VnY/2 € i) with i=2k4+1,0<k< K := (j —3)/2, (1.26) 
while for n even, 


Pui < P(J/nY/2 € Iy) with i= 2k, 1<k < K := (j —2)/2, (1.27) 


and 
Pno/2 < P(/nY/2 € Io). (1.28) 
In either case, 
DUU- --U Ig C [0,n0 — y1- j/n)). (1.29) 
The intervals will be defined by 
See = (k+1)/n + kk +1/2X(k + 1)/n?’, k> 0, (1.30) 
Arpi (= kpi Hk +1/2= ôkp1 + (i +1)/2, i=2k, n even, (1.31) 
Ager t= kpi Hk +1 = ôk +(+ 1)/2, i=2k+1, n odd, (1.32) 


I, := [Az, Ag+] with Ap = 0. (1.33) 
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It will be shown that Jo, ,..., Ig defined by (1.30) through (1.33) satisfy 
(1.26) through (1.29). Recall that n > 5 (1.16) and x; := i/n. 


Proof of (1.29). It needs to be shown that Ag+ı < n(1 — y1 — xj). Since 
j <~<2n, we have K < j/2— 1 < yn/2-— 1 and 


ôk < (K+1)/n + K(K +1/2)/nv2) < xj/2 +nx7/(4V2). 
We have Ax, = nx;/2 — 1/2 + 6x41. It will be shown next that 
1-v1=x > x/2+x°/8, 0<x<1. (1.34) 


The functions and their first derivatives agree at 0 while the second derivative 
of the left side is clearly larger. 
It then remains to prove that 


1/2 +nx2(1/8 — 1/42) — x;/2 > 0. 
This is true since nx? < 2 and x; < (2/8)! = 1/2, so (1.29) is proved. 
Proof of (1.26)—(1.28). First it will be proved that 


V2 | 1 7 (n — Di? (i/n)” 
exp + 


+ 


Pni < % 
4n  288n? 2n? 2n(1 — i2/n?) 


~ Tn 
In light of (1.12) and (1.13), it is enough to prove, for x := i/n, that 


|: (1.35) 


—[ng(x) + log(1 — x”) — (n — 1)x?]/2 < x°"/2n(1 — x°). (1.36) 


It is easy to verify that for 0 < t < 1, 


[0,0] 
g(t) = (1 +tlog(l +t)+(1-— t)log(l1 — t) = Pre — 1). 
r=1 
Thus the left side of (1.36) can be expanded as paar. err n/(2r — 1))/2r = 
A + B where A = 37-5 and B = >>,..,,. We have 


ad’ A/dx? = 5 (2r =n = NG — x”), 
2<r<(n+1)/2 
which is < 0 for O < x < 1. Since A = dA/dx = 0 for x = 0 we have A < 0 
for 0 < x < 1. Then, 2n B < x” Ja — x), so (1.35) is proved. 
We have for n > 5 and x < (./2n — 2)/n that x2"/(1 — x?) < 1073, since 
n > (/2n —2)/n is decreasing in n for n > 8 and the statement can be 
checked for n = 5, 6, 7, 8. So (1.35) yields 


Pui < V2/mn exp[—0.249/n + 7/288n? — (n — 1)i?/2n7]. (1.37) 


Next we will need: 
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Lemma 1.25 For any 0 < a < band a standard normal variable Y, 


P(Y €[a,b]) > /1/22(b — a) exp[—a?/4 — b*/4] h(a, b) (1.38) 
where (a,b) := [4/(b? — a)] sinh{(b? — a?)/4] > 1. 


Proof. Since the Taylor series of sinh around 0 has all coefficients positive, 
and (sinh u)/u is an even function, clearly sinhu/u > 1 for any real u. The 
conclusion of the lemma is equivalent to 


a+b 


b 
f exp(—u?°/2)du > exp(—a?/2) — exp(—b?/2). (1.39) 


Letting x := b — a and v := u — a we need to prove 


X 7 2 2 
(a + =) f exp(—av — v“ /2)dv > 1 — exp(—ax — x° /2). 
0 


This holds for x = 0. Taking derivatives of both sides and simplifying, we 
would like to show 
[ exp(—av — v*/2)dv > x exp(—ax — x7/2). 
0 


This also holds for x = 0, and differentiating both sides leads to a clearly true 
inequality, so Lemma 1.25 is proved. 


For the intervals 7%, Lemma 1.25 yields 


P(JnY/2 € i) = V2/mndy exp[—(Ajg,, + AZ)/n + log(Acyi — Ax) 
(1.40) 
where øk := O(2A,/./n, 2Ag41/./n). The aim is to show that the ratio of the 
bounds (1.40) over (1.37) is at least 1. 
First consider the case k = 0. If n is even, this means we want to prove 
(1.28). Using (1.37) and (1.40) and ġo => 1, it suffices to show that 


0.249/n — 7/288n? — 1/4n — 1/n? — 1/n? + log(1+2/n) > 0. 


Since log(1 + u) > u — u?/2 for u > 0 by taking a derivative, it will be enough 
to show that 


(E) := 1.999/n —3/n? — 7/288n? — 1/n? > 0, 


and it is easily checked that n(E), > 0 since n > 5. 
If n is odd, then (1.37) applies for i = 2k + 1 = 1, and we have Ap = 0, 
A; =6,;+1=1 + I/n, so (1.40) yields 


P(.J/nY/2 € lo) > J2/anexp[—(1 + 1/n)?/n + log(1 + 1/n)]. 


Using log(1 + u) > u — u?/2 again, the desired inequality can be checked 
since n > 5. This completes the case k = 0. 
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Now suppose k > 1. In this case, i < /2n — 2 implies n > 10 for n even 
and n > 13 for n odd. Let sy := ôk + ôk+ı and dy := 5x41 — dx. Then for i 
as in the definition of Ax+1, 


Mig hy = dws (1.41) 
Ani—- Ap = l+, (1.42) 
Sk = Htl + =a) (1.43) 
and 
TTEN ees (1.44) 
A Hee 


From the Taylor series of sinh around 0 one easily sees that (sinhu)/u > 
1+ u?/6 for all u. Letting u := (Az,, — A?)/n > i/n gives 


k+1 
logok > log(1 +i7/6n”). (1.45) 
We have 
dk < 3/241) (1.46) 


since 2k < /2n — 2 and n > 10. Next we have another lemma: 


Lemma 1.26 log(1 + x) > Ax for 0 < x <a for each of the pairs (a, à) = 
(0.207, 0.9), (0.195, 0.913), (0.14, 0.93), (0.04, 0.98). 


Proof. Since x +> log(1 + x) is concave, or equivalently we are proving 1 + 
x > e** where the latter function is convex, it suffices to check the inequalities 
at the endpoints, where they hold. 


Lemma 1.26 and (1.45) then give 
log od, > 0.98i7/6n? (1.47) 
since i?/(6n?) < 1/3n < 0.04, n > 10. Next, 


Lemma 1.27 We have log(Ag41 — Ag) > Ady where à = 0.9 when n is even 
and n > 20, à = 0.93 when n is odd and n > 25, and à = 0.913 when k = 1 
andn > 10. Only these cases are possible (for k > 1). 


Proof. If n is even and k > 2, then 4 <i = 2k < /2n —2 implies n > 20. If 
n is odd and k > 2, then5 <i = 2k + 1 < J/2n — 2 implies n > 25. So only 
the given cases are possible. 

We have k < kn := J/n/2—1 for n even ork, := J/n/2—3/2 for n 
odd. Let d(n) := 1/n + 3k?/n3/? and t := 1/,/n. It will be shown that 
d(n) is decreasing in n, separately for n even and odd. For n even we would 
like to show that 3t/2 + (1 — 3./2)t? + 32° is increasing for 0 < t < 1/4/20, 
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and in fact its derivative is > 0.04. For n odd we would like to show that 
3t/2 + (1 — 9//2)t? + 2783/4 is increasing. We find that its derivative has no 
real roots and so is always positive as desired. 

Since d(-) is decreasing for n > 20, its maximum for n even, n > 20 is 
at n = 20, and we find it is less than 0.207, so Lemma 1.26 applies to give 
A = 0.9. Similarly for n odd and n > 25 we have the maximum d(25) < 0.14, 
and Lemma 1.26 applies to give A = 0.93. 

If k = 1, then n œ> n7! + 3/n?/ is clearly decreasing. Its value at n = 10 
is less than 0.195 and Lemma 1.26 applies with A = 0.913. So Lemma 1.27 is 
proved. 


It will next be shown that for n > 10 
sk < n! +k/Jn. (1.48) 


By (1.43) this is equivalent to 2/ y/n + (2k? + 1)/n < 1. Since k < yn/2 — 1 
one can check that (1.48) holds for n > 14. For n = 10, 11, 12, 13 note that k 
is an integer; in fact k < 1, and (1.48) holds. 

After some calculations, letting s := są and d := dą and noting that 


A? +A? SA — AP + (Ar + Ap) 
ct km = zk k+l wy + (Ag + Angi, 


to show that the ratio of (1.40) to (1.37) is at least 1 is equivalent to showing 


that 
is d s? d? 1 7 i? 0.249 
n n 2n 2n 2n 288n2 M n 
+log(1+d)+logd, > 0. (1.49) 


Proof of (1.49). First suppose that n is even and n > 20 orn is odd and n > 25. 
Apply the bound (1.46) for d? /2n, (1.47) for log z, (1.48) for s, and Lemma 
1.27 for log(1 + d). Apply the exact value (1.44) of d in the d/n and Ad terms. 
We assemble together terms with factors k?, k, and no factor of k, getting a 
lower bound A for (1.49) of the form 


A := afk? /n?/?] — 2B[k/n*/4] + y[1/n] (1.50) 
where, if n is even, soi = 2k and à = 0.9, we get 
a = 0.7 — [2.5 — 2(0.98)/3]//n — 3/n, 
pan? +n), 
y = 0.649 — [17/8 + 7/288]/n — 1/2n?. 
1/4 


Note that for each fixed n, A is 1/n times a quadratic in k/n’/*. Also, a and 
y are increasing in n while £ is decreasing. Thus for n > 20 the supremum of 
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B? — ay is attained at n = 20 where it is < —0.06. So the quadratic has no 
real roots, and since œ > 0 it is always positive, thus (1.49) holds. 

When n is odd, i = 2k + 1, à = 0.93, and n > 25. We get a lower bound A 
for (1.49) of the same form (1.50) where now 


a = 0.79 — [2.5 — 2(0.98)/3]/./n — 3/n, 
B = 1/2n'/4 + 201 — 0.98/6)/n?/4 + 1/2n*/4, 
y = 0.679 — (3.625 + 7/288 — 0.98/6)/n — 1/2n?. 


For the same reasons, the supremum of 6? — ay for n > 25 is now attained 
at n = 25 and is negative (less than —0.015), so the conclusion (1.49) again 
holds. 

It remains to consider the case k = 1 where n is even and n > 10 or n 
is odd and n > 13. Here instead of bounds for są and dą we use the exact 
values (1.43) and (1.44) for k = 1. We still use the bounds (1.47) for log ¢, 
and Lemma 1.27 for log(1 + d). When n is even, i = 2k = 2, and we obtain 
a lower bound A’ for (1.49) of the form a/n + a/n?’ +--+- . All terms n~? 
and beyond have negative coefficients. Applying the inequality —n78/2-% 
—n~/2 .10-* for n > 10 and a = 1/2, 1,..., I found a lower bound A’ 
0.662/n — 1.115/n*/* > 0 for n > 10. The same method for n odd gave A’ 
0.662/n — 1.998/n*/* > 0 for n > 13. The proof of (1.10) is complete. 


IV IV IV 


Proof of (1.11). For n odd, (1.11) is clear when j = 1,s0 we can assume j > 3. 
For n even, (1.11) is clear when j = 2. We next consider the case j = 0. By 
symmetry we need to prove that pao < P(./n|¥|/2 < 1). This can be checked 
from a normal table for n = 2. For n > 4 we have pro < /2/mn by (1.37). 
The integral of the standard normal density from —2/,/n to 2/,/n is clearly 
larger than the length of the interval times the density at the endpoints, namely 
2.,/2/mn exp(—2/n). Since exp(—2/n) > 1/2 for n > 4, the proof for n even 
and j = 0 is done. 

We are left with the cases j > 3. For j =n, we have Pnn = 2~”" and can 
check the conclusion for n = 3, 4 from a normal table. Let @ be the standard 
normal density. We have the inequality, for t > 0, 


PY>1) > yit) := oO -— t] (1.51) 


(Feller 1968, p. 175). Feller does not give a proof. For completeness, here is 
one: 


yt) = -f Y'@œ)dx [ p — 3x~*)dx < PY >t). 


To prove (1.11) via (1.51) for j = n > 5 we need to prove 


1/2” < Olt A- t) 


IA 
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where t, := (n—2)/,/n. Clearly n +> t, is increasing. For n > 5 we have 


1 — t7? > 4/9 and (27)~/2e?-7/" . 4/9 > 0.878. Thus it suffices to prove 
n(log2 — 0.5) + 0.5logn — log(n — 2) + log(0.878) > 0, n>5. 


This can be checked for n = 5, 6 and the left side is increasing in n for n > 6, 
so (1.11) for j = n > 5 follows. 
So it will suffice to prove pai < P(./nY/2 e [G — 2)/2,i/2]) for j <i < 


n. From (1.35) and Lemma 1.25, and the bound øg > 1, it will suffice to prove, 


forx := i/n, 

1 ai 7 (n — 1)x? x” 2 n[(x — 2/n)? + x7] 

4n = 288n? 2 2n(1 — x?) 7 4 
where 3/n < x < 1 —2/n. Note that 2n(1 — x”) > 4. Thus it is enough to 
prove that 


x —x7/2—x"/4 > 3/4n + 7/288n7 


for 3/n <x <1 and n > 5, which holds since the function on the left is 
concave, and the inequality holds at the endpoints. Thus (1.11) and Lemma 
1.23 are proved. 


1.4.3 Proof of Lemma 1.21 

Let G(x) be the distribution function of a normal random variable Z with 
mean n/2 and variance n/4 (the same mean and variance as for B(n, 1/2)). Let 
B(k,n, 1/2) := pee ae Lemma 1.23 directly implies 

GW 2kn —n/2) < Btk,n, 1/2) < G(k + 1) for k <n/2. (1.52) 
Specifically, letting k := (n — j)/2, (1.11) implies 

B(k,n, 1/2) < P(Z>n—k—-1) = P(àk+1>n-Z) = Gk+1) 

since n — Z has the same distribution as Z. Then (1.10) implies 


B(k,n,1/2) > P (5 = wy < = + van) = G(V2kn — n/2). 


Let 
n := ®,° (G(Z)). (1.53) 


This definition of 7 from Z is called a quantile transformation. By Propositions 
1.22 and 1.1(c) respectively, G(Z) has a U[0, 1] distribution and 7 a B(n, 1/2) 
distribution. It will be shown that 


Z=1 < n < Z+(Z—n/2y/2n+1 if Z<n/2 (1.54) 
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and 
Z= (Z ny fn — 1 <n < Z+1 if Z>n/2. (1.55) 


Define a sequence of extended real numbers —oo = c_; < Co < C1 <- < 
Cn = +00 by G(c) = B(k, n, 1/2) Then one can check that n = k on the event 
Ak := {@: Ce-1 < Z(@) < cx}. By (1.52), G(ck) = Bk, n, 1/2) < G(k + 1) 
for k < n/2. So, on the set A; fork < n/2 we have Z—1<cp-1<k=n. 
Note that for n even, n/2 < Cy/2, while for n odd, n/2 = Cm-1)/2. So the left 
side of (1.54) is proved. 

If Y is a standard normal random variable with distribution function ® 
and density ¢, then ®(x) < $(x)/x for x > 0, e.g. Dudley (2002), Lemma 
12.1.6(a). So we have 


P(Z < —n/2) = p(34y<-%) = p(y <-n) 


2 2 2 2 
—2n 
e 1 
= @(-2,/n) < ——— < —. 
2427n an 


So G(—n/2) < G(co) = 2” and —n/2 < co. Thus if Z < —n/2, then 7 = 0. 
Next note that Z + (Z —n/2)?/2n = (Z +n/2)*/2n > 0 always. Thus the 
right side of (1.54) holds when Z < —n/2 and whenever 7 = 0. Now assume 
that Z > —n/2. By (1.52), for 1 < k < n/2 


G((2(k — In)! — n/2) < B(k —1,n, 1/2) = G(ck-1), 
from which it follows that (2(k — 1)n)!/? — n/2 < ck- and 
k—-1 < (c1 +n/2)*/2n. (1.56) 


The function x +> (x + n/2} is clearly increasing for x > —n/2 and thus for 
x > cg. Applying (1.56) we get on the set Ay for 1 < k < n/2 


n = k < (Z+n/2)9/2n + 1 = Z+(Z—n/2)/2n + 1. 


Since P(Z < n/2) = 1/2 < P(n < n/2), and ņ is a nondecreasing function of 
Z, Z < n/2 implies n < n/2. So (1.54) is proved. 

It will be shown next that (7, Z) has the same joint distribution as 
(n — n,n — Z). It is clear that n and n — n have the same distribution and 
that Z and n — Z do. We have for each k= 0,1,...,n, n — n =k if and 
only if n = n — k if and only if Cn-k-1 < Z < Cy_x. We need to show that this 
is equivalent to ck- <n — Z < cx, in other words n — cy < Z <n — Ck-1. 
Thus we want to show that c,-,-1 = n — cx for each k. It is easy to check 
that G(n — ck) = P(Z > cy) = 1 — G(ck) while G(ck) = B(k,n, 1/2) and 
G(cn-k-1) = Bn — k — 1,n, 1/2) = 1 — B(k,n, 1/2). The statement about 
joint distributions follows. Thus (1.54) implies (1.55). 
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Some elementary algebra, (1.54) and (1.55) imply 
In- Z| < 1+(Z—n/2)*/2n (1.57) 
and since Z < n/2 implies 7 < n/2 and Z > n/2 implies n > n/2, 
In —n/2| < 1+|Z -= n/2]. (1.58) 


Letting Z = (n + ./nY)/2 and noting that then G(Z) = ®(Y), (1.53), (1.57), 
and (1.58) imply Lemma 1.21 with C, = n — n/2. 


1.4.4 Inequalities for the Separate Processes 


We will need facts providing a modulus of continuity for the Brownian bridge 
and something similar for the empirical process (although it is discontinuous). 
Let A(t) := +0 ift < —1 and 


A(t) := 0 +OlogA +t- t, t>—l. (1.59) 


Here is a consequence of the Chernoff inequality (1.5) for binomial probabilities 
E(k, n, p), in which k is not necessarily an integer. 


Lemma 1.28 (Bennett) For any k > np and q := 1 — p we have 


E(k,n, p) < exp (-2n (==) (1.60) 
q np 


Proof. According to Chernoff’s inequality (1.5), we have 
E(k,n, p) < (np/kk(ng/n—by"* if k > np. 


Taking logarithms, multiplying by q and simplifying it will suffice to show that 
for k > np 


n- kqlog| 4) < kplog (22) + k= np. 
n—k k 


Taking k as a continuous variable as we may, this becomes an equality for 
k = np, so it suffices to show that the given inequality holds for k > np if both 
sides are replaced by their derivatives with respect to k. Again both sides are 
equal when k = np. Differentiating again, the inequality reduces to np < k 
which is true. 


The previous lemma extends via martingales to a bound for the uniform 
empirical process on intervals (Lemma 2 of Bretagnolle and Massart 1989). 
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Lemma 1.29 For any b withO < b < 1/2 and x > 0, 


P( sup |an(t)| > x//n) < 2exp ( -r (£ a 


0<t<b nb 


< 2exp(—nb(1 — b)h(x/(nb))). (1.61) 


Proof. From the binomial conditional distributions of multinomial variables we 
have forO<s<t <1 


E(F (O| Falu), u <s) = ECO) Fn(s)) 


t—s t—s l-t 
= F,(s)+ —( - F,(s)) = + —F,(s), 
1-s 1-s I1l-s 
from which it follows directly that 
F(t) — t Fas) — 
E Fi) —t F,,(u), u < S = Duel- s, 
l-t l-s 


in other words, the process (F,(t) — t)/( — t), 0 < t < 1 is a martingale in 
t (here n is fixed). Thus, œ„(t)/(1 — t), 0 < t < 1, is also a martingale, and 
for any real s the process exp(sa,(t)/(1 — t)) is a submartingale, e.g. Dudley 
(2002, 10.3.3(b)). Then 


P( sup a,(t) > x//n) < P( sup a,(t)/(1—t) > x/J/n), 
O<t<b O0<t<b 
which for any s > 0 equals 


P ( sup exp(sa,(t)/(1 — t)) > exptsx/ Vi) . 


O<t<b 


By Doob’s inequality (e.g. Dudley 2002, 10.4.2, for a finite sequence increasing 
up to a dense set) the latter probability is 


< inf exp(—sx//n)E exp(san(b)/(1 — b)) < exp ( “sh (= z. 
s>0 = b ab 


by Lemma 1.28, (1.60). In the same way, by (1.6) we get 
P( sup (—a,(t)) > x/ Vn) < exp(~x°(1 — b)/(2nb)). (1.62) 
0<t<b 
It is easy to check that A(u) < u?/2 for u > 0, so the first inequality in Lemma 


1.29 follows. It is easily shown by derivatives that A(qy) > q°h(y) for y > 0 
and 0 < q < 1. For q = 1 — b, the bound in (1.61) then follows. 


We next have a corresponding inequality for the Brownian bridge. 
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Lemma 1.30 Let B(t), 0 < t < 1, be a Brownian bridge, 0 < b < 1 and x > 
0. Let ® be the standard normal distribution function. Then 


P( sup Bit) > x) = 1 — ®(x/yb(1 — b)) 


0<t<b 
+ exp(—2x2) (: -ð (E) (1.63) 
E Dai D)) ' 
IfO < b < 1/2, then for all x > 0, 
P( sup B(t)> x) < exp(—x2/(2b(1 — b))). (1.64) 


O<t<b 


Proof. Let X(t), 0 < t < oo be a Wiener process. For some real a and value 
of X(1) let 8 := X(1) —a. It will be shown that for any real æ and y 


p X(t) — at > y| X(1)} = lig>y} + exp(—2y(y — ))lig<y} (1.65) 
Clearly, if 6 > y then supp._,-,; X(t) — at > y (let t = 1). Suppose £ < y. 
One can apply a reflection argument as in the proof of Dudley (2002, Propo- 
sition 12.3.3), where details are given on making such an argument rig- 
orous. Let X(t) = B(t)+tX(1) for 0 <t <1, where B(-) is a Brownian 
bridge. We want to find P(supy_,-; B(t) + Bt > y). But this is the same as 
P(supp<,<; Y(t) > yl¥YC1) = £) for a Wiener process Y. For B < y, the proba- 
bility that supy—,-, Y(t) > yand f < Y(1) < £ + dy is the same by reflection 
as P(2y — B < Y(1) < 2y — $ + dy). Thus the desired conditional probabil- 
ity, for the standard normal density ¢, is (2y — B)/¢(B) = exp(—2y(y — B)) 
as stated. So (1.65) is proved. 

We can write the Brownian bridge B as W(t)-—tW(1), 0 <t < 1, fora 
Wiener process W. Let W(t) := b-' W(bt), 0 < t < oo. Then W; is a 
Wiener process. Let n := W(1) — W(b). Then 7 has anormal N(0, 1 — b) dis- 
tribution and is independent of W; (t), O < t < 1. Lety := (1 — b)W: (1) — 
Vbn)Vb/x. We have 


P( sup B(t) > x|n, WiC) 


O0<t<b 


= P ( sup (W(t) — (bW: (1) + vVbn)t) > x/Vb\n, mo) . 
0<t<1 

Now the process W; (t) — (bW: (t) + Vbn)t, 0 <t < 1, has the same distribu- 

tion as a Wiener process Y(t), 0 < t < 1, given that Y(1) = (1 — b)W, (1) — 

bn. Thus by (1.65) with a = 0, 


P( sup B(t) > x|n, Wid) = lpsy + Lyy<1j exp(—2x7(1 —y)/b). (1.66) 


O<t<b 
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Thus, integrating gives 


P(sup B(t)> x) = P(y > 1)+exp(—2x?/b)E (exp(2x’y /b) ly<1)) . 
O<t<b 
From the definition of y it has a N(0, b(1 — b)/x*) distribution. Since x is 
constant, the latter integral with respect to y can be evaluated by completing 
the square in the exponent and yields (1.63). 
We next need the inequality, for x > 0, 


1- (x) < 5exp(—2?/2) (1.67) 


This is easy to check via the first derivative for 0 < x < ./2/m. On the other 
hand we have the inequality 1 — ®(x) < f(x)/x, x > 0, e.g. Dudley (2002, 
12.1.6(a)), which gives the conclusion for x > /2/z. 

Applying (1.67) to both terms of (1.63) gives (1.64), so the Lemma is 
proved. 


1.4.5 Proof of Theorem 1.8 


For the Brownian bridge B(t), 0 < t < 1, itis well known that for any x > 0 


P( sup |B(t)| => x) < 2exp(—2x’), 
0<t<1 


e.g. Dudley (2002, Proposition 12.3.3). It follows that 


P(Jn sup |B(t)| > u) < 2exp(—u/3) 


O<r<l1 


for u > n/6. We also have |a;(t)| < 1 for all ż and 


P( sup |a,(t)| > x) < Dexp(—2x’), (1.68) 
O<t<1 


which is the Dvoretzky—Kiefer-Wolfowitz inequality (Dvoretzky, Kiefer, and 
Wolfowitz 1956) with a constant D. Massart (1990) proved (1.68) with the 
sharp constant D = 2. Massart’s theorem will be proved in Section 1.5. Earlier 
Hu (1985) proved it with D = 4/2. D = 6 would suffice for present purposes. 
Given D, it follows that for u > n/6, 


P(/n sup |a@,(t)| > u) < Dexp(—u/3). 


0<t<1 


For x < 6log 2, we have 2e™™*/6 > 1 so the conclusion of Theorem 1.8 holds. 


holds. For x > n/3 — 12logn, u := (x + 12logn)/2 > n/6, so the left side 
of (1.1) is bounded above by (2 + D)n~*e~*/®, We have (2 + D)n~? < 2 for 
n>2and D <6. 
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Thus it will be enough to prove Theorem 1.8 when 
6log2 < x <n/3 — 12logn. (1.69) 


The function t +> t/3 — 12 logt is decreasing fort < 36, increasing fort > 36. 
Thus one can check that for (1.69) to be non-vacuous is equivalent to 


n > 204. (1.70) 


Let N be the largest integer such that 2" <n,sothatv := 2 <n <2v.LetZ 
be a v-dimensional normal random variable with independent components, each 


having mean 0 and variance ¢ := n/v. For integers 0 <i < m let A(i, m) := 
{i + 1,..., m}. For any two vectorsa := (aj,...,a,)andb := (b,...,b,) 
in R”, we have the usual inner product (a,b) := }-;—; aibi. For any subset 


D C A(O, v) let 1p be its indicator function as a member of R”. For any integers 
j=0,1,2,... andk =0,1,..., let 


Ijk := AQ’k, 2} (k + 1)), (1.71) 


let ej be the indicator function of Z; and for j > 1, let can := €j-1,2" — 
e j k/2. Then one can easily check that the family E := {es :I<j<N,0< 
k < 207I} U {ey o} is an orthogonal basis of R” with (en.o, €n,0) = v and 
(CAs ea) = 2/-? for each of the given j, k. Let Wye := (Z, ejx)and Wi = 
(Z, el, p). Then since the elements of € are orthogonal, it follows that the random 
variables W;, forl < j <N,O0<k< 2"—J and Wy ọ are independent normal 
with 


EW), = EWno = 0, Var(W),) = 2/-, Var(Ww.o) = ov. (1.72) 


Recalling the notation of Lemma 1.21, let ®, be the distribution function of 
a binomial B(n, 1/2) random variable, with quantile function ®, ~. Now let 
Gm(t) := On ~((0)). 

The next theorem will give a way of linking up or “coupling” processes. 
Recall that a Polish space is a topological space metrizable by a complete 
separable metric. 


Theorem 1.31 (Vorob’ev-Berkes-Philipp) Let X,Y, and Z be Polish spa- 
ces with Borel o-algebras. Let a be a law on X x Y and let B be a law 
on Y x Z. Let my(x, y) := y and ty(y, z) := y forall (x, y,z) E X xY x Z. 
Suppose the marginal distributions of a and B on Y are equal, in other words 
n:=ao ap =fBo t on Y. Let m(x, y, Z) := (x, y) and T23(x, y, zZ) := 
(y, z). Then there exists a law y on X x Y x Z such that y o i =a and 
yo Tee: = fp. 


Proof. There exist conditional distributions a, for œ on X given y € Y, so that 
for each y € Y, ay, is a probability measure on X, for any Borel set A C X, the 
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function y +> a@,(A) is measurable, and for any integrable function f for a, 


S fda = ff fœ, y)day(x)dn(y) 


(RAP, Section 10.2). Likewise, there exist conditional distributions £, on Z for 
B. Let x and z be conditionally independent given y. In other words, define a 
set function y on X x Y x Z by 


YC) = SSS 1c, y, z)day(x)dBy(z)dn(y). 
The integral is well-defined if 


(a) C = U x V x W for Borel sets U, V, and W in X, Y, and Z respectively, 
(b) C is a finite union of such sets, which can be taken to be disjoint (RAP, 
Proposition 3.2.2 twice), or 


(c) C is any Borel set in X x Y x Z, by RAP, Proposition 3.2.3, and the 
monotone class theorem (RAP, Theorem 4.4.2). 


Also, y is countably additive by monotone convergence (for all three inte- 
grals). So y is a law on X x Y x Z. Clearly y om =a and y o t33! = 


b. 


Now to begin the construction that will connect the empirical process with 
a Brownian bridge, let 


Uno = N (1.73) 
and then recursively as j decreases from j = N to j = 1, 
Uji = Gu (OID PW D) Uji = Uje Uji (1.74) 


k=0,1,...,2%~/ — 1. Note that by (1.72), Gey ew. has a stan- 
dard normal distribution, so ® of it has a U[0, 1] distribution. It is easy to verify 
successively for j = N, N —1,...,0 that the random vector {U; k, 0 < k < 
2N-J} has a multinomial distribution with parameters n,2/—",...,2/—%. Let 
X := (Uo,0, Uo,1, -.., Uo,v—1). Then the random vector X has a multinomial 
distribution with parameters n, 1/v,...,1/v. 
The random vector X is equal in distribution to 


{n(n Ck + 1)/v) — Fa(k/v)), O<k<v-1}, (1.75) 
while for a Wiener process W, Z is equal in distribution to 
{Vn(W((k + 1)/v)— W(k/v)), 0< k <v- 1}. (1.76) 


Without loss of generality, we can assume that the above equalities in distribu- 
tion are actual equalities for some uniform empirical distribution functions F,, 
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and Wiener process W = W,. Specifically, consider a vector of i.i.d. uniform 
random variables (x1, ..., Xn) € R” such that 


1 n 
Fat) := A > lixs 
j=l 


and note that W has sample paths in C[0, 1]. Both R” and C[0, 1] are separable 
Banach spaces. Thus one can let (x1, ..., Xn) and W be conditionally indepen- 
dent given the vectors in (1.75) and (1.76) which have the joint distribution 
of X and Z, by the Vorob’ev—Berkes—Philipp Theorem 1.31. Then we define 
a Brownian bridge by Y,(t) := W,(t) —tW,(1) and the empirical process 
a,(t) := Jn(F,(t) — t), 0 < t < 1. By our choices, we then have 


j-l i 
PEGA) = jo = [s(x - "| (1.77) 
: 


and 


j-l . v-l j 
[Var G] o = (5 2) -— Z 5z] (1.78) 
j=0 


i=0 r=0 


Theorem 1.8 will be proved for the given Y, and œ,„. Specifically, we want to 
prove 


Po := P ( sup |a,(t) — Ya (t)| > (x + 12 logn)/Vi] < 2exp(—x/6). 
0<t<1 
(1.79) 
It will be shown that @,(j/v) and Y,(j/v) are not too far apart for j = 
0,1,..., v, while the increments of the processes over the intervals between 


the lattice points j/v are also not too large. 
Let C := 0.29. Let M be the least integer such that 


C(x +6logn) < 2⁄7! (1.80) 


Since n > 204 (1.70) and ¢ < 2 this implies M > 2. We have by definition of 
M and (1.69) 


gM < ¿2M =< Cae+6lopn) < Cn/3 < 0.152%" < 2072, 


soM <N-3. 

For each ż € [0, 1], let z(t) be the nearest point of the grid ho". 
0<i<2N-™ }, or if there are two nearest points, take the smaller one. Let 
D := X — Z and Dim) := 77, Di. Let C’ := 0.855 and define 


l 


© := {U;k <¢(1+C’)2/ whenever M+1<j<N,1<k <2%~} 


N {U;k > E — C’)2/ whenever M < j < N, 1 <k <2}. 
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Then 


Po < Pi + Po + Ps + P(O*) 


Pi := P ( sup læn) — on (tM (0))| > 0.28 + 610gm)/V/i) (1.81) 


0<t<1 


Pa := P ( sup |¥,(t) — Yn Orm (A)| > 0.22(x + 6logn)/vi) » (1.82) 


0<t<1 


and, recalling (1.77) and (1.78), 


N-M m 
P; := 2N-™ max P [(iDn) — "pal > 0.5x + 9logn) n J , 
mEA(M) v 
(1.83) 
where A(M) := {k2”: k =1,2,...}N A(O, v). 
First we bound P(®°). Since by (1.74) Ujk = Uj—1,2% + Uj—1,2k41, we have 


oc (J {Ume > +c 962") 


O<k<2N-M~? 


u U {Umme <0- g2), 


O<k<2N-M-1 


Since Um+2,k and Um+1,k are binomial random variables, Lemma 1.28 gives 
P(0°) < 20! (exp(—c2¥ HC’) + exp(—¢2”+t'h(—C")). 


Now 2h(C’) > 0.5823 > h(—C’) > 0.575 (note that C’ has been chosen to 
make 2h(C’) and h(—C’) approximately equal). By definition of M (1.80), 
c62™+! > C(x + 6logn), and 0.575C > 1/6, so 


P(©°) < 2™ exp(—x/6). (1.84) 


Next, to bound P, and P}. Let b := 2“—N-! < 1/2. Since œ„(t) has sta- 
tionary increments, we can apply Lemma 1.29. Letu := x + 6logn. We have 
by definition of M (1.80) 


nb = n2M-N-1 < Cu/2. (1.85) 


By (1.69), u < n/3 sob < C/6. Recalling (1.59), note that h’(t) = log(1 + 1). 
Thus A is increasing. For any given v > 0 it is easy to check that 


y t+ yh(v/y) is decreasing for y > 0. (1.86) 
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Lemma 1.29 gives 


0.28 
Pi < p nb(1 bh ( =) 
n 


C C 2 
arene = = |uh (0.28. = 
< exp ( 5 | =| (0 8 =) 


by (1.86) and (1.85) and since 1 — b > 1 — C/6, so one can calculate 


Py < I MO ge exp(—x/6). (1.87) 


The Brownian bridge also has stationary increments, so Lemma 1.30, (1.64), 
and (1.85) give 


Py < 20M exp(—(0.22u)* /(2nb)) 


IA 


=O exp(—(0.22)u/C) < 2 ee (1.88) 


since (0.22)?/C > 1/6. 
It remains to bound P3. Fix m € A(M). A bound is needed for 


Px(m) := P [(iDm) z T Do )| > 0.5x +91ogn) n o}. (1.89) 


For each j = 1,..., N take k(j) such that m € I; k). By the definition (1.71) 
of I; x, k(M) = m2™ — 1 and k(j) = [k(j — 1)/2] for j = 1,..., N where 
[x] is the largest integer < x. From here on each double subscript j, k(j) will 
be abbreviated to the single subscript j, e.g. e', = e k j The following 
orthogonal expansion holds in €: 


m 
140,m) = FENO 5 cje’, (1.90) 
M<j<N 


where 0 < cj < 1 form < j < N. To see this, note that 140m) L ek for j < 
M since 2™ is a divisor of m. Also, lam L ek for k Æ k(j) since 1 40,m) has 
all 0’s or all 1’s on the set where en, , has nonzero entries, half of which are +1 /2 
and the other half —1/2. In an orthogonal expansion f = $- jj fj we always 
havec; = (f, Ff /MF;\? where llv]? := (v, v). We have le; ll = 2U-%/2 Now, 
(140m), e',) is as large as possible when the components of e! equal 1/2 only 
for indices < m, and then the inner product equals 2-2 so lc;| < 1 as stated. 
The m/v factor is clear. 
We next have 


ej = 2! Neyo + JOE e (1.91) 
i>j 


where s(i, j,m) =0 or 1 for each i, j, m so that the corresponding factors 
are +1, the signs being immaterial in what follows. Let Aj; := (D, e’). 
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Then from (1.90), 
m 
Don) — = D(»)| < >> JA; (1.92) 
ý M<j<N 
Recall that W; = i(Z; e',) (see between (1.71) and (1.72)) and D = X — Z. 
Let éj := (27 /¢)"? W! for M < j < N. Then by (1.72), and the preceding 
statement, €y+41,...,&y are i.i.d. standard normal random variables. We have 
U;k = (X, e;,x) for all j and k from the definitions. Then U; = (X, ej). Let 
U; = (X, e’). By (1.74) and Lemma 1.21, (1.9), 
IU; — „U;&;/2| < 1+87/8. (1.93) 
Let 
L; := |W; — JUj&;/2| = lé;llyU; — v ¢2|/2 


by definition of €;. Thus 


|Aj| < Lj +1+87/8. (1.94) 
Then we have on © 


IVU; — vi = (U; - £24 //e24 + JU) 


IU; — §24| 1 
Jozi 14+V1-C” 


where as before C’ := 0.855. Then by (1.74), (1.91), and (1.8) of Lemma 
t21, 


IU; = ¿2| < 2|Uy -n| +2 $` 2|U;| 
j<i<N 

< 24+ (0 +C $O 2-16; 
j<i<N 


on O, recalling that by (1.73), Uy = Uyn. =n. Let C2 := 1/1 +41 -— C’). 
It follows that 


L; < PPC + SCVTFE D 2 BIEL. 199 
j<i<N 
Applying the inequality |; ||Ẹ;| < E? + £?) /2, we get the bound 
DY ye alsa AO AE (1.96) 
M<j<N j<i<N M<j<N 
where 


AS ; So 2-4 Y 2U-98 


M<r<j j<i<sN 
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Then 
1 271/2 _ Q(M—j)/2 271/2 
< 
Ai <4 | re = 


< 1+ /2 Z 2™-j-3/2 (1 _ 271/2), 
Let C3 := C(1 + J2)V/1 + C'/2 < 1.19067. Then 


YLHYe<G} ë (1.97) 


M<j<N M<j<N 
+ So 24 ENC, (1 — V1 + C244? 1é 1/01 — ga] . 
M<j<N 
Let 
c = VC _ v2VTF OW? + y 
=A = 9-1) 4 
and foreach M let cy := 1/(4C42™/?). Then for any real number x, we have 


x(1 — C42™/?x) < cy. It follows that 


2, Li Ss DS Cs} teuQr i’ 


M<j<N M<j<N 
< Cocy2 MVP 1-2-7) +S) GE 
M<j<N 
C,2-” 
< + C387. 
VVI +C 2 


M<j<N 
Thus, combining (1.94) and (1.97) we get on © 
> Jala ne(5 +6) De (1.98) 
M<j<N M<j<N 


We have E exp(té7) =(1—2t)! fort < 1/2 and any standard normal vari- 


able € such as &; for each j. Since €y+1,...,&€y are independent we get 
1 a 2 1\ \ 4-9/2 
E ex = Aj|} lo} < 1—=(C3+- 
P\ {3 2 |Aj|} lo} <e ( 3 ( 3 + :)) 
M<j<N 


< eN/3Q1SI3(N-M) < 92N-15M 


Markov’s inequality and (1.92) then yield 


P3(m) < e7*/6y—392N-1.5M | 
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Thus 
P < et /6y-393N—-2.5M < 2725M o7*/6. (1.99) 


Collecting (1.84), (1.87), (1.88) and (1.99) we get that Py < (23-Melhy 
IM 42-2 5M) g=], By (1.80) and (1.70) and since x > 6log2 (1.69) and 
M > 2, it follows that Theorem 1.8 holds. 


1.5 The DKW Inequality in Massart’s Form 


A. Dvoretzky, J. Kiefer, and J. Wolfowitz (1956) proved the “Dvoretzky— 
Kiefer—Wolfowitz” (DKW) inequality, namely that there is a constant D < +00 
such that for any distribution function F on R and its empirical distribution 
functions F,,, we have for every u > 0, 


Pr(./n sup |(F, — F)(x)| > u) < Dexp(—2u’). (1.100) 


For the case that F is the U[O, 1] distribution function U, this is used in the 
proof of the Bretagnolle—Massart theorem (1.68) with D = 6. 
Massart (1990) proved the following: 


Theorem 1.32 (Massart) The inequality (1.100) holds with the constant D = 
2. 


Remark. The constant D = 2 is best possible because, for a Brownian bridge 
y and F continuous, specifically F = U, we have as shown by Kolmogorov 
(1933) and as also follows from the Komlés—Major—Tusnady approximation, 
Theorem 1.8, 


lim Pr(/n sup |(F, — F)(x)| > u) = Pr(sup |y;| > u) 
n—>oo x t 


[e6] 
=29 (=! exp(—2j7u’), 

j=1 
where the last equality is Theorem 1.6 (or Proposition 12.3.4 of RAP). As u 
becomes large, the latter expression is asymptotic to 2 exp(—2u?). 
Proof. It will suffice to prove Theorem 1.32 for the U[0, 1] distribution U 
by Proposition 1.2. In fact, if F is continuous, then its range includes the 
open interval (0, 1). Whether it contains 0 or | is immaterial since (U, — 
U)(0) = 0 — 0 = 0 and (U, — U)X(1) = 1 — 1 = 0 with probability 1. Thus, 
the distribution of sup, |(Fa — F)(x)| is the same for all continuous F. For a 
general F, possibly discontinuous, its range is included in [0, 1], so sup, |(Un — 
U)(F(x))| < supp<,<; (Un — U)(t)|. So Theorem 1.32 for U[O, 1] implies it 
for arbitrary F. 
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Recalling a,(t) := ./n(U, — U)(t) for 0 <t <1, let D* := sup, a,(t), 
D7 := sup,(—a,(t)), and D, := sup, |o,(¢)| = max( D}, D7). We have the 
following symmetry: 


Proposition 1.33 For anyn = 1,2,..., D} and D} are equal in distribution. 


Proof. Let X1,..., Xn be the ii.d. U[O, 1] variables on which U,, is based. 
Let Y; := 1 — X; for j =1,...,n. Then Y,..., Y, are iid. U[O, 1]. For a 
function F and real x define F(x—) := lim,,, f(y) if the limit (left limit) 
exists, as it always will for a distribution function F. It will be different from 
F(x) if F has a jump at x, specifically, for an empirical distribution F„ and x 
equal to one of the X ;. Let 


1 n 1 n 
G,(t):= = X lya = > $lx = 1- F,( -2)-) 
j=l j=l 


for 0 < t < 1. Thus almost surely 


sup G,(t)—t = sup [1 —t— F,(1 — t)—)] 


0<t<1 Ost<1 
= sup (s — F,(s—)) = sup (s — F,(s)), 
gesel O<s<1 


which gives the conclusion. 


Massart (1990, Theorem 1) gives the following fact, which is interesting in 
itself and implies Theorem 1.32 (see the Remarks after it): 


Theorem 1.34 For any n= 1,2,... and any 4 > min(./(og 2)/2, ¢n-"/), 
where € := 1.0841, we have 


Pr(D7 > A) < exp(—2A7). (1.101) 


Remarks. If exp(—2A7) < 1/2, then à > „(log 2)/2, which implies the hypoth- 
esis of Theorem 1.34. Also, Proposition 1.33 implies Pr(D, > A) < 2Pr(D, > 
A). Further, Theorem 1.32 holds trivially if 2 exp(—2A7) > 1, so it suffices to 
prove Theorem 1.34 to prove Theorem 1.32 in all cases. 


Proof. If for a given A > 0, 
D; = <n sup (t — U,(t)) > À, (1.102) 


0<t<1 
then ¢ — U,(t) = à/y/n for some t, because between its downward jumps 
at the observations X;, t — Un (t) is continuously increasing. Let Xa) < Xo 
< +++ < Xin) (almost surely) be the order statistics of X;,..., Xn. Thus 
sup t — U,(t) = max Xw — (k — 1)/n 
1<k<n 


O0<t<1 
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(the supremum occurs just to the left of some Xg); the supremum is strictly 
positive with probability 1 because Xq) > 0. So if (1.102) holds there is a 
smallest k = 1,...,” with Xœ — (k — 1)/n > A/.J/n. Letting X) := 0, we 
must have t — U,,(t) = A/,/n for some t with Xk-1) < t < X. Let Tn := 
Tn (À) be the least t € [0, 1] such that t — U„(t) = A/./nif one exists (D7 > A), 
otherwise let t, = 2.Ift, < 2, then for some j = 0, 1,...,n — l, Tn — j/n = 
A/N, i.€. Tn = i + J which implies that 


j<n-—ìvyn. (1.103) 


If à > y/n then Pr(D} > à) = 0, implying the conclusion of the theorem, so 
suppose A < ./n. Let J > 0 be the largest integer less than n — A./n. 
The following fact, according to Massart (1990), is due to Smirnov (1944). 


Proposition 1.35 (Smirnov) For each à with O < à < Jn and e€ := i4/,/n, 
and each j =1,..., J, Pr(t, =e + j/n) = Pan(J) where 


ee ey a eR a. e (1.104) 


For j = 0, Pr(t, = €) = Pa n(0) := (1 — £)". 


Proof. For each n, € =A/./n and i =1,..., J let A; := [Xo <e+ EL, 
Here is a 


Claim. We have {t, = €} = AÑ and for j=1,..., J, {Tn = £+ i) = 
(gee Ai) N AS st 


Proof of Claim. On A‘, Xa) > £, which is equivalent to (U — U,)(€) = £ and 
so to T, = £. This event has probability (1 — £)". 

On Aj for i > 1, U, (++) > i/n and so (U — U,„) (= + €) < e — 
l/n < £ and Tn # LI + £. Thus on B; := Mizizi Aj, T = i +E. 

On ASi» 
j/n) = j/n and (U — U,) (2 + e) = e. Thus on B; N AS 4, Tp =£ + j/n. It 


follows that t, = i + £ for the unique j < J such that w € Bj N Asy (these 
sets are disjoint), where Bo := Q, if such a j exists, otherwise t, = 2. This 
proves the Claim. 

Now continuing the proof of Proposition 1.35, Xq) has distribution func- 
tion 1 — (1 — xı)” and so density n(1 — x1)"—! for 0 < xı < 1. Forl <i <n, 
conditional on X1),..., X©, Xa+) is the least of n —i variables 1.i.d. 
U[XQq), 1] (this conditional distribution only depends on Xg)). Thus Pr(Xg+1) = 
tXo) = (A —21)/d — Xo", and the conditional density of X¢+1) given 
Xo =x; is (a — i) — x41)" "1/1 — x;)"". Iterating, the joint density of 
Xap- --, Xg+n is ald — xj41)" 7" /(n —j— l)! forO <x, sxn. 


X(j41) > i + £, and if A; also occurs, X(j) < it +e,soU,(e + 
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Xj < xXj41 < 1 and 0 elsewhere. Thus 


P +! a g 
T| Tn =E = i Jj 
n (n=j=1) 7 


where 


E e+1/n e+(j—1)/n 
I; af dx f da: f dx;, (1.105) 
0 xX) Xj-1 


A 


1 ;\ aj 
Jj = (1 xj) dx = (1 —e- i) /(n—j), (1.106) 
E n 


+(j/n) 


and a (j + 1)-fold integral equals the given product because x; < € + (j — 
1)/n and xj4) => e+ j/nimply x; < xj+1. Also, (j/n) + € < 1 follows from 


j < J < n — àn. So, to prove Proposition 1.35 it remains to show that 


. j- 
I; =e(4+6) /jl. (1.107) 
n 


This will be proved for each n with O <7 < 1 — il in place of £ and by 
induction on j. Equation (1.107) holds for j = 1. Assume it holds for j — 1 
for some j with 2 < j < J, and for each ņ with O < 7 < 1 — (j — 2)/n in 
place of e. In the integral (1.105) make the changes of variables &; = x; — x, 
fori =2,..., j and let ô := € — xı + 1. Then 


E 5 5+1/n b6+(j—2)/n 
n=f an f ae | dss- f dg. (1.108) 
0 0 & Ej- 


Applying the induction hypothesis to the inner (j — 1)-fold integral in (1.108) 


we get 
£ j—1 j-2 
ej ana (2 +6) /j— 1)! 
0 n 


E 1 j j-2 
= [an(e ae )( dg 1) =i, 
0 n n 


so setting y := £ — xı gives 


E 1 j j-2 
G-py= f (y+) (+2) dy. 
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An integration by parts then gives that (j — 1)!J; equals 


1 1 pote 1 e ar 
Sp het = NY a eae VE dy 
j-1 n n 0 j-1Jo n 


pall) a q 
ae (Gell 
“ll. se] 


1 ¡j=l 
fled) fer 2-5 - Dla (er) /j, 
j-1 n n j n n 


which proves (1.107) and thus Proposition 1.35. 


Next, for a Brownian bridge y = {y;}o<;<1, and A > 0, let t) := inf{s > 0 : 
Ys > A} if this is less than 1, otherwise let t, = 2. Then for 0 < s < 1, 


Prt, < s)=1-— (=S) + exp(—2A7) (: — ð (>) 
7 vs =s) Lis 


by Lemma 1.30. Let f(s) := d Pr(t, < s)/ds. Then for the standard normal 
density ¢, 


fils) = a ( i Jaw je? 9 (EZA) a6) (1108) 
ee 70) No r=) aan 


where 


F _ —0-s)+s 2s—1 
(9) = “Jang — sph IPAs?" 
2 (1 -= 25)? 
J/sd—s) 253/2(1 = s)3/2 
—4s +45? — (1 — 2s)? = I 
IP —sp2 «Ass — sy?" 


A(s) = 


Next, 


2 Co. a 
2s(1-s) — 21-5) 
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Combining terms gives a formula which Massart attributes to E. Csaki (1974), 
à 2? 2s — 2 
exp . 
/2n 2s(1—s)) 253/201 — s)3/2 


fils) = 


L : ( 1 ) (1.110) 
= ex ‘ ` 
~ 2r s3/2./1 — 8 P 2s(1 — s) 

From the definitions we have for each à > 0 
PD, > A= JO Pral) (1.111) 

O0<j<n—-dJ/n 

and using Theorem 1.5, 
1 
exp(—2A7) =Pr(y, < 1) = |. fils)ds. (1.112) 
0 


The next fact is one of the main steps in the proof of Theorem 1.34. 


Proposition 1.36 Let j be a nonnegative integer with j <n —iJ/n. Let s = 
(2e/3)+ j/n, s! = 1 — s, and v,(s) = 1/ (s(s? — 1/(4n?)). Ifne > 2, then 


gelfi War (1.113) 
Pa ni J) < re 35 652 n,À,S . 
where 
04 2 i 
Enas = exp | — — =—(unls) + vals) | AG). (1.114) 
ns 24n 


Some lemmas and other facts will be used to prove Proposition 1.36. The first 
one has implications for the binomial distribution. 


Lemma 1.37 Let0 <¢ <q=1-p <1. Let 


h(p, £) = (p + £) log (=) + (q — £) log (<*) 


Fort > 0 let 
2 


g(t) =t 


Then 
(i) g is a strictly increasing convex function with g(t)/t — 1/4 as t > +00, 
(ii) For t := e/(q — £), 


e2 


e€) > 
2(p + €/3)(q — &/3) 
Proof. For (i), we have g(0) = 0, and for all t > 0 


h(p, 


+ eg(t)/t. 


g(t) = (7/9) + 2t/3) 701 +1)! > 0. 
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To see that g’ is increasing, note that (3 + 2r)*(1 + 1)/t? is decreasing. So g is 
convex. As t > +00, g(t)/t > 1 — 1/(4/3) = 1/4 as stated. 
For (ii), clearly 


(q — €)log((q — €)/q) = —(e/t) log + t). 
Let x := p + £. Then 0 < £ < x. We need to show 


x log x — x log(x — €) 


e? e e*/(q — 8} | 
=z- 50AN E 2(1 + 26/(3q =e) 
p a i e*/(q — 8) 
~ 2(x-2)(1-x+ %2) 2(1 + 2¢/(3(q — £)) ` 


The last term equals —£?/(1 — x + 2¢/3), and so we need to show 


e2 


x log(x) — x log(x — €) IG — 26/3) e>0 


The left side of the last inequality is 0 when £ = 0. Its derivative with respect 
to £ is 


(23 /9)\(x — e) "(x — 22/3)? > 0 


Thus (i1) holds. 


A consequence for the binomial distribution is: 


Theorem 1.38 Let S„ be a binomial (n, p) random variable and suppose that 
0<e<q=1-p.Then 


ne? 
Pr(S, —np = ne) < exp ( Xp + €/3)\(q — T l 


Proof. The probability is less or equal to 


pte q-€é n 
( P ) ( f ) | = exp(—nh(p, £)) 
pte q-€ 


by Chernoff’s inequality, which follows from one of Hoeffding’s inequalities, 
Theorem 1.13. Then we can apply Lemma 1.37(i1). 


Now, to begin the proof of Proposition 1.36 for j > 1, recall Theorem 1.17 
giving Stirling’s formula with error bounds. From it, for 1 < j <n, 


1 ; r 
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where C} := exp(—1/(12/ + 1)), using that + — -+ < 0. Then by (1.104), 


n—J 


. À n : = 
Pan) < Sue JIG = + A/n) 


(2 tilt) (" = a 

X 7 z Ci. 
J n—-J 

Recalling h(-, -) as in Lemma 1.37 and that s = 2¢/3+ j/n ands’ = 1 — s, it 

follows that 


ine r 2e Gay 42 =i 
Pan\J ~ in S 3 S 3 S 3 


-exp (-nh (s — 5, e)) Cj: 


Define w(t) for 0 < t < œ by 


3 2t 
Wi) = = log +) + Fog (1+ 2). (1.115) 


Setting t = e/(s — 2e/3), which agrees with the definition of t in Lemma 
1.37(@i)), that Lemma gives 


3/2 neg(t) 


2e \ 71 
C: 
=) Js — 28/36 + 8/3) iexp( 7 


1 
Pan) = i (: = ) fils), 


or equivalently 


C; 2e\ 7 
Pan(j) < (: + =) exp (250 + vo) fas). (1.116) 


To bound the exponentiated term, we have the following. 


Lemma 1.39 Let 0 := 0.4833. Let g and w be the functions defined in Lemma 
1.37 and (1.115) respectively. For t > 0 and v > 0 let 


2 
Tv, t) := v?’ 9(t) — vtw(t) + EENE 
Then T(v,t) > Ofor0 <t <v. 


Proof. (a) First suppose v = t > 0. Then, straightforwardly we have 
d (Tt,t)\  -20/3-—t/3+17/9 
dt\ 2 J (1+2t/3)2 ` 

The quadratic in the numerator has two roots, one being negative, and a positive 


root to = (3 + V9 + 240)/2. The derivative in (1.117) equals —26/3 < 0 when 
t = 0, so tọ is a relative minimum of T (t, t)/ t? and is the absolute minimum for 


(1.117) 
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t > 0. We find T (tọ, to)/ t = 5.05 - 1076 > 5- 1076 > 0, so T(t, t) > 0 for all 
t>0. 
(b) Now for general 0 < t < v, T is a quadratic polynomial in v for fixed t. 
Its derivative with respect to v is 2vg(t) — tw(t). If 
2g(t)— w(t) > 0 (1.118) 
then T(v, t) is increasing in v for all v > ¢ and so T(v, t) > 0. Or, if the 
discriminant D(t) of the quadratic polynomial satisfies 
D(t) = PYA — 4g(1)6t7/( + 2t/3) < 0, 
or equivalently 
A(t) := (1 + 2t/3)W7(t) — 40g(t) < 0, (1.119) 


then T(t, v) remains positive for all v > t as it is for v = t and has no roots. 
So to prove Lemma 1.39 it will suffice to show that 


(i) 2g(t) — W(t) > 0 for all t > 3.37. 
(ii) A(t) < 0 for 0 < t < 3.37. 
Proof of (i). We have 


1 ri t 2 =l 2t = 
28 C= Y (H) = zt —2t—3)1 +t) ltz ; 
We have 217 — 2t — 3 > 0 for t > (1 + V7)/2; thus 2g — y is increasing for 
such t. Since (1 + J/71)/2 = 1.823 < 3.37 we have that for t > 3.37, 


2g(t) — W(t) = 2g(3.37) — W(3.37) = 0.000775 > 7 - 1074 > 0, 


proving (i). 

For (ii), Massart states that the function R(t) = (1+ 4) w7(t)/g(t) is 
increasing for t > 0 (which is needed only for t < 3.37). The proof given 
in the first half of p. 1275 of his paper is not correct, as the function Ro = Y?/g 
is not increasing. Nevertheless it appears true that R(t) is increasing by exam- 
ining computed values of it on a grid, 0, 0.01, 0.02, ..., 3.37. It follows that 
(ii) holds. 


To continue the proof of Proposition 1.36, three Claims will be stated and 
proved. 


Claim 1. Let 6 := 0.826. Then for any x € [0, 1], 


—1/2 ( =) 3 
(E2 < (1—x + — J exp(—x°). 


Proof of Claim 1. Leta = 135/32 and p(y) := 1+ ay — exp(2By) for0 < y < 
1. Then p”(y) < 0, so p is concave. It satisfies o(0) = 0 and p(1) > 0.00134 > 
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0. So p(y) > 0 for 0 < y < 1. It follows that for O < x < 1, 
3 3 3 1 3 2 
exp(26x°) < lt+ax° <1l+ax + 35% (5 — 12x) 


= (1 = x + 1.5x?)’ (1 + 2x). 


Taking the square root of both sides, Claim 1 follows. 


Claim 2. For £ = ./,/n as usual and j, s, s’, and v,(-) as defined in Proposition 
1.36, if ne > 2 and ns’ > 1, we have 


2e\ 1 E e? evls”) 
1+ — <({1 : 1.12 
( i £) = ( ag =) = ( An ) eee 


Proof of Claim 2. First it will be checked that ¢/(3s’) < 1, or equivalently 
E€ < 3s’ =3 — 2e — 3j/n, or € =A/./n < 1 — j/n, which is true since j < 
n—i,/n by (1.103). So we can apply Claim 1 with x = ¢/(3s’). To prove 
(1.120) we need to show that 


Be? /(27s") > €7un(s')/(24n), 
or equivalently 
88/9 > (ne (1 — ns’). 


Since ne > 2 and ns’ > 1, we have 


o0 
D 


(ne (1 — nsh?) < : < 


as 88/9 = 0.73422. So Claim 2 is proved. 


Claim 3. For v, as defined in Proposition 1.36, and any € > 0 and s > 0 
satisfying 1/n < e < 3s/2, we have 


1+12 z\\ n 1 Me (1.121) 
"T3 Z nas An ` i 


Proof of Claim 3. We have identically 
1+12 2e\\" __! 
bo (as 3 12ns 
4 2e\\ ! 
= (8ne — 1)(12ns) 1+ 12n | s — 7 ‘ 
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So (1.121) reduces to 


8ne — 1 €e f, 1 
zz Is , 
1+ 12n(s — 28/3) 2 4n2 


or equivalently 


1 
2(8ne — 1)s* — 12ne*s + (8ne — 1) | e - — ) = 0. 
2n2 


The derivative of the left side with respect to s is 4(8ne — 1)s — 12ne?, which 
is positive for s > 2¢/3, so the left side is increasing and it suffices to check 
the inequality for s = 2¢/3. So we want to show that 


(8ne — 1)(34n7e? — 9)/(18n7) > 8ne?, 


for which, since ne > 1, it suffices to show that (7ne)(25n7e7) > 144n363, 
which is true. So (1.121) holds and Claim 3 is proved. 


Now to finish the proof of Proposition |.36 for j > 1, recall (1.116), in which 
t=e/ (s E 2), implying t = ne/j < ne. Lemma 1.39 with v = ne gives 


C; 2e \ 17 8 
Panl j) < = (: + =) exp (=) fils). 
n 3s ns 


Claim 2 gives an upper bound for (1 + aay, By Claim 3, as s — (2¢)/3 = 
j/n, we have 

1 P 1 €7v,(8') 
12j7+1 ~ 12ns 24n ` 


log(Cj) = 


Noting that 0 — 1/12 < 0.4 finishes the proof for j > 1. 


Proof of Proposition 1.36 for j = 0. In this case s = 2¢/3. We have p,_,(0) = 
(1 — £)” by Proposition 1.35. One can natural define h(p, 8) when ô = q as 
— log(p) since for fixed q with O < q < 1, (q — ô) log((q — 5)/q) > O as ô t 
q. Then (1 — £)” = exp(—nh(1 — £, €)). To apply Lemma 1.37, we will have 
t = €/(q — £), which will be define as +00 in this case with q = £; fort — +00 
the limit of g(t)/t is 1/4. Then Lemma 1.37 gives 


z 2 nE 
Prn(O) = (1 — £)" = exp(—nh(1 — £, £)) < exp | — ==]. 
255! 4 


Define H(v) for v > 0 by 


3log(3/2)  log(27) 4 v 4 0.4 log) 


H g= 
o 2 2 ag 2 
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Then it is straightforward to check that 


à 3/0 0.4 Z) (Hne) 
sS ex ex ex ne)). 
/27Nn P ne P 255! P 


We have H'(v) = 4 — 9 — + = O if and only if v? — 2v — 1.6 = 0. The only 


v? w 7 
positive root of this is at vp = 1 + V2.6. This is the minimum of H for v > 0 
because H(v) > +00 as v > +00. Thus H(v) > H(vo) > 0.01534 > 0 for 


all v > O. It follows that 
2 
À pn exp as exp | — 2 
~v 2rn né 25s! 


1 Qe \ 71? 0.4 
<- (1 + =) exp (“=) fil), 
n 35 nE 


where the second equation follows on expanding f}(s) by (1.110). 
Since ne > 2, it follows that 


evls) ig 4 1 wt z 
24n <( ne(3-=)) < 9/(S5ne), 


from which it follows that 
0.4  v(s) O44 9/55 0.4 
+ < < . 
nE 24n ` ne ~ ns 


(=e < 


Pirin (0) < 


Combining gives 


1 bey 0.4  &7v,(s) 
P, „(0) < -|1 + 7 exp zy f(s). 
n 3s ns 24n 


Via Claim 2, (1.113) follows, and Proposition 1.36 is proved. 


In the proof of Theorem 1.34, the integral in (1.112) will be compared to 
Riemann sums and thereby to the sums in (1.111). The comparison will use the 
next lemma. 


Lemma 1.40 Let0 <6 < s < 1 — ô and s' = 1 — s. If G is a continuous func- 
tion with G(x) > 0 for s — ô < x < s + ô and log(G) is convex, then for any 
A> 0, 


1 s+ô 2 
= G = jd 
gla ( naw) j 


2 292 
> G(s) exp ( 5) exp ( a (be 8%) | 4 (s'(s” = ay), 


Proof. Jensen’s inequality, e.g., RAP, 10.2.6, n(EX) < En(X) for n(-) convex 
and X with finite mean, will be applied twice. Both times the probability 
measure is the uniform distribution U[s — ô, s + 6]. First, the convex function 
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is exp, then, second, it is log(G). We get 


1 s+é 2 
z5 i G(u) exp (- = =) du 


1 s+6 2 
> exp (= f (ioca — man z =) au) 


1 s+ô 2 
> exp (ioe Gis = zl l ze +a- whadu) 


The function u +> 1/u has a positive fourth derivative. We can apply Simpson’s 
rule with remainder as given by Davis and Polonsky (1972, 25.4.5, p. 886): if 
f has a continuous fourth derivative f®, h > 0, x; = xo + ih for i = 0, 1, 2, 
and f; = f(x;), then 


x2 h h’ 
f Fods = Zinta Al- O 


for some £ € [xo, x2]. So if f® > 0, 


x2 h 
[foods = Fh + 4fi + fi (1.122) 
xo 

Thus 
Eft ee ae f 1 ia eet 82 
AJ u Sesti s8 s) s Bete 
Next, 
1 s+ô 1 s'+ô 
ae (1 —u)7'du = — v™!dv, 
26 s—ô 26 s'—ô 


and the Lemma follows. 


Next are some identities for special integrals. 


Lemma 1.41 For any a > 0, b > 0, and à > 0, let 


exp(2A7) f peer ea 42 
Iq, (A) = ————_ u`" 2q —u) /2 ex C du. 
l v2w Jo P\ ul —w) 
Then the following hold: 
(i) 4Q)/2 = ho) = 1; 
(ii) baQ)/2 = baQ) =4447: 
(iii) h oA) =2+A~. 


Proof. Clearly Ig, = Ip a. For any u with O < u < 1, 


u 1/2 aq ne u) 1/2 b _ el pau u) 1/2 51/2 a =u 1/2 “(1 E uy, 
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which implies for any a > 1 and b > 1 that 
Tap = lizi + Ta.p-1- (1.123) 


For a = b = 1, using (1.112) without its middle term, we get (i). Next, differ- 
entiating with respect to A gives (ii). Then, applying (1.123) with a = 1 and 
b = 2, (iii) follows from (i) and (ii). So Lemma 1.41 is proved. 


Now we can prove Theorem 1.34 under some conditions. 


Proof of Theorem 1.34 forn > 39 andi. < ./n/2. Since n > 39, the hypothesis 

on à in Theorem 1.34 becomes A > tn 1/6, Thus ne = à y/n > 3.6764. Define 

the function y(-) by y(x) := (e* — 1)/x for x > 0. One can easily see (e.g. 

from the Taylor series) that this function is increasing. Recalling that s > 2e/3 
= 24 i as in Proposition 1.36 we have 


for s = Sj=R 
exp (84) <14y( 28) 04 
Plas) S TI (36764) aa 


Let u := 0.4345. Then exp (0.4/(ns)) < 1 + u/(ns). Applying Proposition 
1.36, we get another upper bound for p, n(j): 


1 E€ e? H €7(Un(S) F Un(s’) 
n (: 35’ j =) (1 T =) exp ( 24n ) fils). 


Preparing to apply Lemma 1.40, note that z(-) defined by z(x) = log(6x? — 
2x + 1) — (5/2) log(x) is convex for x > 0: calculation gives z” (x) - 2x?(6x? — 
2x + 1)? = h(x) where 


h(x) = 36x7(x — 1)? + 60x? — 20x + 5 > 0. 


(Massart gives a different positive quartic polynomial, agreeing in the x‘, x, 
and constant terms but not the x? and x? terms.) Let for 0 < u < 1 
2 


= FL 4 *) 0 wy? - - 
G0 = Fe (14 Fra w(t aaa) 


Then for c := log(A/(6V 27x £), a constant with respect to u, we have 


1— 
log(G(u)) = c + log (1 + £) - 2 inetd +z (=) f 
nu 2 E 


in which each term is convex in u, so log G(-) is convex. So Lemma 1.40 with 
ô = 1/(2n) gives 


s+1/(2n) e g2 u 
n(J) < 1 1 — du. 
Pri do ( atay) + £) Aadu 
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Summing over j in (1.111) and using also (1.112) we get, in the notation of 
Lemma 1.41, recalling that I, p = Iba, 


2 
E E 
exp(2A7) Pr(D7 > 2) < Io) — zA + Shi) 


2 
e E 
F t s FE A A) + P hÜ). 
n 3n 6n 


By Lemma 1.41 and simple calculations we then get 


3 
3 expa?) Pr(D7 > A) — 1) 
1 3u BL \ aa 
< MA): = -1l lat —4— 4 / 
< mA) thtt) 
-Efa tnt # fang t) ne. (1.124) 
2 22 2 À 
Remark. Smirnov (1944), as quoted by Massart, had given the asymptotic 


expansion 


Pr(D7 > à) ~ exp(—21°) (: Pia +o in) (1.125) 
n 3./n 
if A = O(n"), By contrast, Massart’s inequality (1.124) is one-sided, but the 
first term —1 on the right confirms that the —24/(3./n term in Smirnov’s 
expansion is valid non-asymptotically in a one-sided sense, which is what one 
wants. 
To check that n, is convex in A for each n, note that 


3 1 3 
j ( Jae 53) 
has positive second derivative and all other individual terms are convex. Thus, 
to show that 7,(A) < 0 for cn 1/6 <A <./n/2 it will suffice to show that 
an = n(n! 6) < O and by := nn(./n/2) < 0. 
It will be shown that a, and b, are decreasing in n for n > 39. We have 


1/6 1 3 1/2 
na(n") = —1 + (a + 7 G + 3n) + ur) n"? 


an 


4 2 
1/3 1/6 
H n -1, -1/6 , ” —3/2 
3u 1 1 1 u = 
— 1 ae rm ie E 3 a En 2/3 
ey (=) (5+ n) + (¢ FE n 


—2un"! + ee 2c, 
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As ¢ = 1.0841 (Theorem 1.34) and u = 0.4345, ¢ — w/(2¢)? > 0. Terms 
with positive coefficients and negative powers of n are decreasing. Just one 
term, —2un"!, is increasing. It will suffice to show that (3/¢)n~!/3 — 2n7! is 
decreasing for n > 39, or that 3x — 2.1682x? is increasing for 0 < x < 1/39. 
Indeed its derivative is positive there. A calculation shows that b, is a con- 
stant plus a sum of negative powers of n times positive coefficients, so 
it is also decreasing in n. We have a39 = —0.006382 < —0.006 < 0 and 
b39 = —0.4238 < —0.4 < 0, so both a, and b, are negative for all n > 39, 
SO Mn(A) < 0 for ¢n7!/° < à < ./n/2. So Theorem 1.34 is proved for n > 39 
anda < /n/2. 

For the rest of the proof of Theorem 1.34, let Cy, = exp(2A7) Pr(D, > A). 


Proposition 1.42 Letn > 2 and let à be such that 0 < à < ./n. Then 
(i) For à > J/n/2, Cin <9, 


(i) Dia Pa — a) 
(iii) For à > 1/2, we have £C, n < 3.61. 
Proof. (i) Let Li n(j) = log(exp(2A7) pa.n(j)) for 0 < j <n—AAJ/n. It will 


suffice by (1.111) to show that for each such j, dL} n/dà < 0 for à > /n/2. 
We have by (1.35) 


2. te ss 
dn G+ aa) G FADA] = Jn) 


Recall that € = A/./n. Set x = £ + j/n, noting that 0 < x < 1 by the restric- 
tion on j. Next, 


d ne?(1 — 2x}? — (x — e\(1— x) 

— Linj) = ; 1.126 

band) i (1.126) 
For j = 0 it follows easily that dL}, „(0)/dà < 0, so we can assume that j > 
1 and so x > £ + 1/n. To show that ne2(1 — 2x)? — (x — e)(1 — x) > 0, or 


equivalently 


s l 2 A2 
d + Že pat +5 tne? (20-142) ‘ 
n n n n n 
we can note that 4e? > 2e > 1, which implies ne?(4j?/n?) > j?/n. Since 
j/n < j?°/n, (i) holds. 

(ii) Clearly Pr(D, > à) is nonincreasing in à > 0. By (1.111) and (1.104) 
we have 


n—1 
d _ Jl „n—j nn n 
0> re > | =J/n 3 j (n— jn (‘) —1], 


so (ii) follows. 
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(iii) First suppose ne > 2 and £ < 1/2. We use Proposition 1.36 and the 
bound exp(0.4/(ms)) < exp(0.3), and proceed in the same way as in the proof 
of Theorem 1.34 for n > 39 and € < 1/2, with G(u) replaced by G: (u) defined 


to have a factor e°? in place of 1 + u/(nu). Then log G;(u) is convex and we 


get 
2 
0.3 E E 
Crain Se (1.009 = z110) + Taw) ‘ 


Using Lemma 1.41 and 2/,/n < à < ./n/2, it follows that 


2X 1 
a < e3 o3] x -1/2 
Une 3Jn- ( ( 4a)” 


2X 3 
2 03 0.3 , 
e + 3a ( 5) 


Combining with (i) gives 
Cyn < exp(max((8/n), 0.3)) (1.127) 


for any integer n > 4 and any A > 0. 
An alternate bound for C} „ will be helpful for small values of n. For h as 
defined in Lemma 1.37, we have for j > 1 


+) n .j—1 — jy- jp-’” J 
paat) =È") (n= jn (=) 


xexp(=nh (1 =é= Le)); 
n 


By Lemma 1.37 we then get 


Pan) < aval)" — j) Jn exp(—2d”). 
J 
Thus by (ii), 
Cin < AVN + Pi.n(0) exp(22’). (1.128) 


It follows from (1.126) that for j = 0, dL, „(0)/dà <0, and for 1 < j < 
n—AdJ/n, dLa n(j)/dà < 1/à. Thus 


d 1 
C n < Cc n n 0 22 . 
doen = 3 ( A, Px,n(0) exp( )) 
Combining this last inequality with (1.127) if n > 14, or (1.128) for n < 13, 
we get 


d 
gO” < max(2 exp(4/7), v13) < 3.61, 


which proves (iii) and so Proposition 1.42. 
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Proof of Theorem 1.34 for n < 38 or à > ./n/2. By Proposition 1.42 and the 
first part of the proof of Theorem 1.34, we can assume that n < 38. Then the 
assumption on à in Theorem 1.34 reduces to à > „/(log 2)/2, which implies 
A> 1/2. 
Letting 7 := 0.01, let Ayn = {5 +kn: ke N} NA [;, vn) . A computer 
calculation, reported by Massart (1990), gave 
max max C} „n < 0.951. (1.129) 
n<38 AEA n 
In a confirming computation, the maximum was found to be = 0.94955, attained 
atn = 38 and A = 1/2. Combining (1.129) with Proposition |.42(iii), we get 


max sup Cı, n < 0.951 + 3.61n < 0.9871 < 1, 
n<38 1/2<h<Jn 


which finishes the proof of Theorem 1.34 for n < 38 and so completes its proof, 
and that of Theorem 1.32 by the Remarks after Theorem 1.34. 


Problems 


= 


. Find the covariance matrix on {0, 1/4, 1/2, 3/4, 1} of: 


(a) the Brownian bridge process y,; 


(b) U4 — U. Hint: Recall that n!/?(U,, — U) has the same covariances as Yr. 


2. LetO < t < u < 1. Leta, be the empirical process for the uniform distribu- 
tion on [0, 1]. 


(a) Show that the distribution of œ„(t) is concentrated in some finite set A;. 


(b) Let f(t, y, u) := E(&œ,(u)|æn(t) = y). Show that for any y in A,,(u, 
f(t, y, u)) is on the straight line segment joining (t, y) to (1, 0). 


3. Let (S, d) be a complete separable metric space. Let u be a law on S x S 
and let ô > 0 satisfy 


uA, y): d(x, y) > 28}) < 36. 


Let m(x, y) := y and P := uor". Let Q be a law on S such that 
p(P, Q) < ô where p is Prohorov’s metric. On S x S x S let mi2(x, y, z) := 
(x, y) and 73(x, y, z) := z. Show that there exists a law œ on S x S x S such 
that æ o 775' =p, œ omz’ = Q, and 


a({(x, y,z): d(x,z) > 35}) < 46. 


Hint: Use Strassen’s theorem, which implies that for some law v on S x S, if 
L(Y, Z) = v, then L(Y) = P, L(Z) = Q, and v({d (Y, Z) > 5}) < ô. Then the 
Vorob’ev—Berkes—Philipp theorem 1.31 applies. 
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4. Let A= B = C = {0, 1}. On A x B let 
u := (80,0) + 28a, + 580,1) + a.n) /9. 


On B x C letv := [ĝo + ôa, + 60,1) + 3ôa,1]/6. Find a law y on A x 
B x C such that if y = L(X, Y, Z), then £L(X, Y) = wand L(Y, Z) = v. 


5. Let J = [0, 1] with usual metric d. For € > 0, evaluate D(e, I, d), N(e, I, d) 
and N(e, I, I, d). Hint: The ceiling function [x] is defined as the least integer 
> x. Answers can be written in terms of [-]. 


6. For a Poisson variable X with parameter à > 0, that is, P(X = k) = 
e* yak /k! for k = 0, 1,2, .., evaluate the moment generating function Ee'* 
for all t. For M > À, find the bound for Pr(X > M) given by the moment 
generating function inequality (Theorem 1.10). 


7. The bound Ke~** given by the right side of (1.1) is trivial if it is larger than 
1. With K and A as given in the Bretagnolle—Massart Theorem 1.8, how large 
must x be for the bound to be nontrivial? 


8. Based on Theorem 1.8 (Bretagnolle and Massart), for n = 225, find b such 
that the empirical process a, satisfies 


sup |(@, — Y,)(t)| < b 
0<t<1 


for a Brownian bridge Y, except with probability at most 0.05. 


9. Kolmogorov’s test of the hypothesis Ho that X;,..., X, arei.i.d. with a given 
fixed distribution function F is to form the empirical distribution function F, 
based on X),..., X, and the statistic D, := /n sup, \(F, — F)(x)|, rejecting 
the hypothesis at a level œ > 0 if for the observed value D? of Dy, if Po( Dn = 
D?) < a, where Pp is the probability assuming Hp. Suppose F is continuous. 
For a = 0.05, 0.01, and 0.001, find Ma such that 2exp(—2M2) = a. Evaluate 
the expression (twice a sum) in the Remark after Theorem 1.32, which is the 
asymptotic distribution under Hp for n large, for u = Ma for the three as. Hint: 
In these cases the series can be written as a power series in a. The series will 
converge fast, so not many terms need to be added. 


Notes 


Notes to Section 1.1. The Glivenko—Cantelli Theorem 1.3 also appeared as 
Theorem 11.4.2 in RAP. RAP’s Notes to Section 11.4 say that Glivenko proved 
the (main) case where F is continuous. Francesco Paolo Cantelli (1875-1966) 
was a professor at the University of Rome from 1931 through 1951 and was 
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chief editor of the Italian actuarial journal Giornale dell’Istituto Italiano degli 
Attuari from 1930 through 1958. Both Glivenko’s and Cantelli’s contributions 
to their theorem were published in Italian in 1933 in that journal, as was the 
paper of Kolmogorov (1933b), with Kolmogorov’s paper immediately preced- 
ing Glivenko’s, and Cantelli’s some 300 pages later. I have not seen Glivenko’s 
or Cantelli’s papers in the original. Kolmogorov’s paper appeared in Russian 
translation in his selected works in 1986. 

The contributions of Doob (1949) and Donsker (1952) were mentioned in 
the text. 

When it was realized that the formulation by Donsker (1952) was incorrect 
because of measurability problems, Skorokhod (1956); see also Kolmogorov 
(1956), defined a separable metric d on the space D[O, 1] of right-continuous 
functions with left limits on [0, 1], such that convergence for d to a continuous 
function is equivalent to convergence for the sup norm, and the empirical 
process a, converges in law in D[0, 1] to the Brownian bridge; see, for example, 
Billingsley (1968, Chapter 3). The formulation of Theorem 1.7 avoids the need 
for the Skorokhod topology and deals with measurability. 

Comments on the Koml6s—Major—Tusnady and Bretagnolle—Massart state- 
ments (Theorem 1.8) are given in the notes to Section 1.4. 


Notes to Section 1.2. Apparently the first publication on -entropy was the 
announcement by Kolmogorov (1955). Theorem 1.9, and the definitions of 
all the quantities in it, are given in the longer exposition by Kolmogorov and 
Tikhomirov (1959, Section 1, Theorem IV). 

Lorentz (1966) proposed the name “metric entropy” rather than “e-entropy,” 
urging that functions should not be named after their arguments, as functions of 
a complex variable z are not called “z-functions.” The name “metric entropy” 
emphasizes the purely metric nature of the concept. Actually, “e-entropy” 
has been used for different quantities. Posner, Rodemich, and Rumsey (1967, 
1969) define an £, ô entropy, for a metric space S with a probability measure P 
defined on it, in terms of a decomposition of S into sets of diameter at most € 
and one set of probability at most ô. Also, Posner et al. define e-entropy as the 
infimum of entropies — }_; P(U;) log(P(U;)) where the U; have diameters at 
most £. So Lorentz’s term “metric entropy” seems useful and has been adopted 
here. 


Notes to Section 1.3. Sergei BernStein (1927, pp. 159-165) published his 
inequality. The proof given is based on Bennett (1962, p. 34) with some incor- 
rect, but unnecessary, steps (his (3), (4),...) removed as suggested by Giné 
(1974). For related and stronger inequalities under weaker conditions, such as 
unbounded variables, see also Bernstein (1924, 1927), Hoeffding (1963), and 
Uspensky (1937, p. 205). 
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Hoeffding (1963, Theorem 2) implies Proposition 1.12. Earlier, Chernoff 
(1952, (5.11)) had proved (1.5) and (1.6). Okamoto (1958, Lemma 2(b’)) redis- 
covered (1.6). Inequality (1.7) appeared in Dudley (1978, Lemma 2.7) and 
Proposition 1.16 in Dudley (1982, Lemma 3.3). On Ottaviani’s inequality, The- 
orem 1.19, for real-valued functions see (9.7.2) and the notes to Section 9.7 in 
RAP. The P. Lévy inequality Theorem 1.20 is given for Banach-valued random 
variables in Kahane (1985, Section 2.3). For the case of real-valued random 
variables, it was known much earlier, see the notes to Section 12.3 in RAP. 


Notes to Section 1.4. The proof in this section for Theorem 1.8, due to Bretag- 
nolle and Massart (1989), is an expanded version of their proof. It was included 
in the lecture notes of a MaPhySto course given in Aarhus, Denmark, in August 
1999, and has been revised slightly for inclusion here. 

Komlós, Major, and Tusnady (1975) formulated a construction giving a 
joint distribution of @, and Y,, and this construction has been accepted by 
later workers. But Komlós, Major, and Tusnady gave hardly any proof for 
(1.1). Csörgő and Révész (1981) sketched a method of proof of (1.1) based on 
lemmas of G. Tusnady, Lemmas 1.21 and 1.23. The implication from Lemma 
1.23 to 1.21 is not difficult, but Csörgő and Révész did not include a proof of 
Lemma 1.23. Bretagnolle and Massart (1989) gave a proof of the lemmas and 
of the inequality (1.1) with specific constants, Theorem 1.8. Bretagnolle and 
Massart’s proof was rather compressed and some readers have had difficulty 
following it. Csörgő and Horvath (1993, pp. 116-139) expanded the proof 
while making it more elementary and gave a proof of Lemma 1.23 for n > no 
where nọ is at least 100. Section 1.4 gives a detailed and in some minor details 
corrected version of the original Bretagnolle and Massart proof of the lemmas 
for all n, overlapping in part with the Csörgő and Horvath proof, and then 
it proves (1.1) for some constants, as given by Bretagnolle and Massart and 
largely following their proof. 

Mason and van Zwet (1987) gave another proof of the inequality (1.1) and 
an extended form of it for subintervals O < t < d/n with 1 < d < n and logn 
replaced by log d, without Tusnády’s inequalities and without specifying the 
constants c, K, A. Some parts of the proof sketched by Mason and van Zwet 
are given in more detail by Mason (1998). 

Bennett (1962) proved Lemma 1.28 with the inequality (1.60); see also 
Shorack and Wellner 1986, p. 440, (3). Bennett showed also that the Chernoff 
inequality for p < 1/2 (1.6) follows from (1.60). 

Vorob’ev (1962) proved Theorem 1.31 for finite sets. Then Berkes and 
Philipp (1977, Lemma A1) proved it for separable Banach spaces. Their proof 
carries over to the present case. Vorob’ev (1962) treated more complicated 
families of joint distributions on finite sets, as did Shortt (1984) for more 
general measurable spaces. 


15:27 


P1: KNP Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


CUUS2019-01 


CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


60 1 Donsker’s Theorem and Inequalities 


In Theorem 1.31, the assumption that X, Y, and Z are Polish can be weak- 
ened: they could instead be any Borel sets in Polish spaces (RAP, Section 13.1). 
Still more generally, since the proof of Theorem 10.2.2 in RAP depends just on 
tightness, it is enough to assume that X, Y, and Z are universally measurable 
subsets of their completions, in other words, measurable for the completion of 
any probability measure on the Borel sets (RAP, Section 11.5). Shortt (1983) 
treats universally measurable spaces and considers just what hypotheses on 
X, Y, and Z are necessary. 


Notes to Section 1.5. This section is based on the paper of Massart (1990), with 
some details inserted or revised. Massart states Proposition 1.35 without proof, 
referring to Smirnov (1944), which I have not seen, for a proof. Birnbaum and 
Tingey (1951) state and prove one Theorem, given in their formula (3.0), which 
is very close to Proposition 1.35. One difference is that they treat Pr(D, < à) = 
1 — Pr(D, > A) and expand the latter as a sum. Birnbaum and Tingey cite a 
paper by Smirnov from 1939, not 1944. Their sketched proof is very parallel 
to but different from the detailed one given above. 
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Gaussian Processes; Sample Continuity 


2.1 General Empirical and Gaussian Processes 


In Chapter 1 we needed essentially just two Gaussian processes, the Brownian 
motion {x;};>9 and the Brownian bridge {y;}o<;<1, to get a limit for a classical 
empirical process {y/n(F, — F)(x)}xer as {yF(x)}xer.- It will be seen how that 
limit can be extended. 

Let (S, S, P) be a probability space, where S might often be a Euclidean 
space R’ but now with d > 2. Let X1, Xo,..., be independent, identically dis- 
tributed variables with values in S and distribution P. Let P, be the empirical 
probability measure P,, := 1 i ôx, on S, recalling that 6,(A) = 14(x) = 1 
for x € A and 0 otherwise. For any given set A € S, /n(P, — P)(A) will 
converge in distribution by the original central limit theorem (for bino- 
mial probabilities, de Moivre 1733) to a variable Gp(A) with distribution 
N(O, P(A)(1 — P(A))). If B is another set in S, then the random vector 
s/n((P, — P)(A), (Pa — P)(B)) will converge in distribution as n > oo to 
(Gp(A), Gp(B)), by the multivariate central limit theorem (e.g. RAP, Theorem 
9.5.6), having a normal bivariate distribution with mean 0, and the covariance 


E(G p(A)G p(B)) = P(AN B) — P(A)P(B) = Covp(la, 18). (2.1) 


Similarly, for any finite collection {A Diaa of sets in S, /n{(P, — P)(A DÉ 
will converge in distribution to {Gp(A pia which has a k-variate normal 
distribution with mean 0 and covariances given by (2.1) for A = A;, B = Aj. 

For any function f and probability measure Q on S such that Eg f = 
f fdQ is defined and f f?dQ < œ, let Of := Eo(f). This will always hold 
for Q = P,. If it also holds for Q = P, then ./n(P, — P)(f) will converge 
in distribution to a random variable Gp(f) with distribution N(0, Varp(/)) 
where Var p(f) is the variance of f for P, by the usual central limit theorem. 
If the hypotheses hold for functions fi,..., fk with respect to P, then the 
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random vector /n{(P, — P)(f pies will converge in distribution as n —> oo 
to {Gp(f pie which have a k-variate normal distribution with mean 0 and 
covariances given by 


E(Gp(f)Gp(g)) = (f, 8)o,p = Cove(f, 8) = Ep(fg)— Ep fEpg. (2.2) 


The main question this book treats is to find conditions under which the conver- 
gence in distribution extends to infinite classes C of sets or F of functions, with 
respect to uniform convergence over such a class. In Chapter 1 this was done 
in R for the family of half-lines C = {(—oo, x]: x € R}. This chapter will 
consider Gaussian processes with general index sets T, such as Gp when T is 
a class C of sets or a class F of functions. For a central limit theorem to hold 
uniformly over a class C or F, we will need the limit process G p to be reason- 
ably well behaved on the class, as, for example, the Brownian bridge sample 
paths t > y,(@) can be and are taken to be continuous in t for (almost) all œ. 


2.2 Some Definitions 


A real-valued stochastic process consists of a set T, a probability space 
(Q, A, P), and a map (t, w) > X, (œ) from T x Q into R, such that for each 
t € T, X,(-) is measurable from Q into R. The process is called Gaussian iff for 
every finite subset F of T, the law L({X;},¢r) is a normal distribution on R”, 
where the law £(X) of a random variable or vector X means the probability 
measure which is its distribution. 

For any set S let R5 be the set of all real-valued functions on S. Whenever 
A C B, there is a natural mapping (projection) 7g,4 of R? onto R^, namely, 
restriction of functions to A. The Kolmogorov existence theorem for stochastic 
processes, applied to real-valued processes, is as follows: let T be any set, and 
suppose for any finite subset F C T, we are given a probability measure Pr 
on RŽ, a finite-dimensional real vector space with its usual Borel o-algebra. 
Suppose that the Pr are consistent in the sense that for each nonempty set E C 
F finite, the image measure of Pr under xr g is Pg. Then by Kolmogorov’s 
theorem (e.g. RAP, Theorem 12.1.2) there exists a probability measure Pr on 
IR’, defined on the smallest o-algebra making all xr p measurable for F finite, 
such that the image measure of Pr by z7,r is Pr for all finite F. This Pr gives 
a stochastic process where 2 = R”, and for each t € T and œw € Q we take 
X (Œ) := w(t). 

Kolmogorov’s theorem may not be surprising if one realizes that R7 is the 
class of all real-valued functions on T, which have a real value at each f, but 
may vary wildly as ¢ varies. 

Kolmogorov’s theorem specializes to Gaussian processes as follows. If 
a finite set F has cardinality d > 1, let we R? be any vector, and let 
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C: F x F >R, regarded as a d x d matrix, be symmetric and nonnega- 
tive definite. Then we know that there exists a uniquely determined normal 
(Gaussian) probability measure N(u, C) with mean u and covariance matrix 
C. Let T again be any set and suppose for each nonempty finite F C T we have 
ur € R? andCr: F x F > R, whichis symmetric and nonnegative definite. 
Suppose that these are mutually consistent in the sense that whenever E C F, 
the restriction of up to E is ug, and the restriction of Cr to E x E is Cg. Then 
the probability measures Pr = N(uFf, Cr) are consistent, so Kolmogorov’s 
theorem applies, and the resulting process is Gaussian with the given means and 
covariances. 

Let X be a real vector space. Recall that a seminorm is a function || - || from 
X into the nonnegative real numbers such that ||x + y|| < ||x|| + |||] for all x 
and y in X and ||cx|| = |c|||x|| for all real c and x € X. The seminorm ||- || 
is called a norm if ||x|| = 0 only for x = 0 in X, and then (X, || - ||) is called 
a normed linear space. A norm defines a metric by d(x, y) := ||x — y||. A 
normed linear space complete for this metric is called a Banach space. As with 
any metric space, it is called separable if it has a countable dense subset. A 
probability distribution P defined on a separable Banach space will be assumed 
to be defined on the Borel o-algebra generated by the open sets, unless another 
o-algebra is specified. Then P will be called a law. 

Some Banach spaces with especially pleasant properties are Hilbert spaces, 
defined as follows. For a real vector space X, an inner product is a function 
(.,-) from X x X into R which is symmetric, i.e. (x, y) = (y, x) for all x and 
y in X; bilinear, meaning that for each fixed x, (x, -) is linear from X into R, 
thus for each y, (-, y) is linear from X into R; and positive definite, meaning 
that (x, x) > 0 for all x Æ 0 in S. 


Example: For X any finite-dimensional Euclidean space, the usual dot product 
of vectors is an inner product. 


Any inner product (-, -) defines a norm by ||x|| = (x, x)!/: one can easily 
see that for any constant c, 


1/2 
> = |elllall. 


lIexl] = (cx, cx)? = [° œx, x)] 
The subadditivity ||x + y|| < |x|] + lly|| follows from the Cauchy—Schwarz 
inequality for inner products, |(x, y)| < ||x||||y|], which is proved in the same 
way as the usual Cauchy—Schwarz inequality, by expanding (x + ty, x + ty) > 
O for all real ¢ as a quadratic in t. 

If a norm is defined by an inner product, then the inner product is uniquely 
determined, as (x, y) = (lx + yl]? — |x — yll?)/4. Those norms definable 
from inner products are characterized by the fact called the parallelogram 
law: |x + yl? + Ilx — yll? = 2x1? + 2llyll?. 
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If X is complete for the norm ||x|| defined by an inner product, then it 
is called a Hilbert space. Among the most-encountered Hilbert spaces are 
the Z? spaces defined as follows. Let (S, S, u) be a o-finite measure space. 
Let £7(S, S, u) be the set of all measurable real functions f on S such that 
te f?du < +00. Let L?(X, S, u) be the set of equivalence classes of functions 
in £? for equality -almost everywhere. On L? we have the inner product 
(f,g)= Ís f(x)e(x)du(x), which does not depend on the choice of f and g 
from their equivalence classes. A proof that this is in fact an inner product is 
easy. 

Let (X, || - ||) be a separable Banach space. A law P on X will be called 
Gaussian or normal iff for every continuous linear form f € X’, P o f7! is 
a normal law on R. Recall that a law on a finite-dimensional real vector space 
is normal if and only if every real linear form is normally distributed (RAP, 
Theorem 9.5.13). 


2.2.1 The Isonormal Process 


Let H be a Hilbert space with inner product (-,-). There exists a Gaussian 
process L indexed by H with mean EL(f) = 0 for all f € H and covariance 
E(L(f)L(g)) = (f, g), where the covariances are nonnegative definite by the 
assumptions on an inner product. This process exists by the general existence 
theorem for Gaussian processes mentioned earlier in Section 2.2. It is called 
the isonormal process on H. For any x € H, L(x) is a Gaussian ran- 
dom variable with distribution N(0, Ixl, and for any xi,...,Xn € 
H, (L(x), ..., L(xn)) have a jointly normal distribution with covariance given 
by the inner products (x;, xj). 

The isonormal process has the following linearity property (also in RAP, 
Theorem 12.1.4): 


Theorem 2.1 For an isonormal process L on a Hilbert space H, and any 
x, y € H and real constant c, with probability 1, 


L(cx + y) = cL(x) + L(y). (2.3) 


Proof. Expanding E((L(cx + y)— cL(x)— L(y))*) and using the given co- 
variances for L, we find that the given expectation is 0, and the conclusion 
follows. 


Remark. The set of probability 0 on which (2.3) fails can depend on c, x, and 
y. The equation can be taken to hold with probability 1 for all real c and all x 
and y in a suitable subset, in Theorem 3.2 below. 


Two stochastic processes {X,, t € T} and {Y,, t € T} defined on the same 
index set, but possibly on different probability spaces, will be said to have the 
same laws, or one will be said to be a version of the other, if for each finite 
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subset F of T, {X;, t € F} and {Y,, t € F} have the same law. If X, and Y, 
are defined on the same probability space, then one is said to be a modification 
of the other if for each t € T, we have P(X, = Y;) = 1. On the relationship of 
versions and modifications, especially for the isonormal process restricted to a 
set, see Appendix I. 

Any Gaussian process with mean 0 can be factored through L in the follow- 
ing sense: 


Theorem 2.2 Let {X;}:<r be any Gaussian process with mean 0 defined on 
a probability space (Q, A, Q). Let H be the Hilbert space L?(Q, A, Q) and 
L the isonormal process on H. For each t € T let Y, := L(X;,(-)). Then the 
process {Y;};er is a version of {X;}rer- 


Proof. Since L is a Gaussian process with mean 0, so is {Y;};e7. For any 
s,t € T, the covariance E(Y,Y,) equals the inner product of X,(-) and X;(-) in 
L7(Q, A, Q), which is E(X;X;), so the covariances of the two processes are 
the same, and being Gaussian, they are versions of each other. 


One can get the Gaussian process Gp with mean 0 and covariance (2.2), 
which is the limit of empirical processes ,/n(P, — P), from an isonormal 
process as follows. Given (S, S, P), let H be the Hilbert space L7(S,S, P) 
and let Wp be the isonormal process on H. Set 


Gp(f) := We(f — Ep f), 


which equals Wp(f) — Ep fWe(1) almost surely for any given f. Clearly, 
this gives a Gaussian process with mean 0 and covariance (2.2). From this 
representation, one can see by Theorem 2.1 that for any f and g in £? and real 
c, with probability 1 


Gp(cf + g) =cGp(f)+ Gp(g). (2.4) 


Suppose a Gaussian process {X,},-7 with mean 0 is defined on a space T, 
such as a Euclidean space or a subset of one, which already has a usual metric 
e. Many authors on Gaussian processes, instead of factorizing through L as in 
Theorem 2.2, adopt the idea in an alternate form by keeping T but using on it 
the “natural metric” (more precisely a pseudometric) defined by the process, 
namely, for s and t in T, 


ex(s,t) := (E(X, — XÐ. 


A pseudometric on T is a function p on T x T which is a metric except possibly 
that p(x, y) = O for x Æ y. Then t > X; is an isometry (it preserves distances) 
from T with ex into the Hilbert space H = LQ, A, Q) with its usual metric. 
For two examples, for the Brownian motion process {x;};>9, the natural metric 
would be |s — t|!/?, the square root of the usual metric. For the Brownian 
bridge {y;}o<;< the natural “metric” is (|t — s| — (t — sy)? a pseudometric 


14:22 


P1: KNP Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


CUUS2019-02 


CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


66 2 Gaussian Processes; Sample Continuity 


since e,(0, 1) = 0. For other processes the natural metric may have no simple 
relation to the usual metric. 

For the isonormal process L, often the Hilbert space H will be taken to be 
separable and infinite-dimensional. In the finite-dimensional case we have a 
representation as follows. For any real Hilbert space H, the map iy from H 
into its own dual Banach space, defined by i7(x)(y) := (y, x), is linear, one- 
to-one, and onto. The dual space is itself a Hilbert space, and iy preserves the 
Hilbert structure. If H is finite-dimensional, then we have the standard normal 
probability measure N(0, 7) defined on it, where J is the identity matrix or 
operator. The next fact is immediate: 


Proposition 2.3 For any finite-dimensional Hilbert space H, the mapping in, 
with the probability measure N(O, I) on H, is a version of the isonormal pro- 
cess L. 


The last proposition does not extend to H infinite-dimensional because a 
probability measure “N(0, I)” does not exist on H: let {ej} be an infinite 
orthonormal set. Then Z; := L(e;) are iid. N(O, 1) and ve Zi = +00 
almost surely. 

For any subset A C H, L(A)* will be defined as ess.sup,<4 L(x), the small- 
est random variable Y such that Y > L(x) a.s. for all x € A, where Y is deter- 
mined up to a.s. equality. Here “ess.sup” stands for “essential supremum.” It 
will be shown in the next lemma that L(A)* is well-defined for A separable. 
It will be seen later (proof of Theorem 2.19) that for A nonseparable it is also 
well-defined and equal to +00 a.s. Similarly let |L(A)|* := ess.supye4|L(x)|. 


Lemma 2.4 For any separable subset A C H, L(A) and |L(A)|* are well- 
defined up to almost sure equality. 


Proof. If B is countable, then L(B)* exists and equals sup,<g L(x), clearly. 
Let B be any countable dense subset of A. Let Y := L(B)*. Then Y is 
measurable. If U is any random variable such that U > L(x) a.s. for all x € A, 
then clearly U > Y a.s. For each x € A, there is a sequence y, € B such that 
lyn —x||? < 1/n? for all n = 1,2,..., and then by Chebyshev’s inequality 
and the Borel—Cantelli lemma, L(y,) —> L(x) a.s. Thus L(x) < Y a.s., and 
as seen above Y is the smallest random variable with this property, so L(A)* is 
well-defined and equals Y a.s. The proof with absolute values is the same. 


Definitions A set C in H is called a GB-set iff |L(C)|* < co a.s. Also, C will 
be called a GC-set iff it is totally bounded and the restriction of L to C can be 
chosen so that each of its sample functions x œ> L(x)(@), x € C, is uniformly 
continuous on C. 
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2.3 Bounds for Gaussian Vectors 


First, some bounds for one-dimensional Gaussian variables will be given. Let 
® be the standard normal distribution function and ¢ its density function, so 
(x) = 2r)! exp(—x*/2) for all real x, and ®(x) = f* ~ b(u)du. 


Proposition 2.5 Let X be a real-valued random variable with a normal distri- 
bution N(O, o°). Then 


(a) for any M > 0, Pr(|X| > M) < exp (—M?/ (20?) ; 
(b) if M/o > 1, then 


Th (2#) < Pr(X| > M) < Ta (2#) 
M oj) ~ M oj 


Proof. Replacing X by X/o, we can assume o = 1. For (a) we want to prove 
2@(—c) < exp(—c’/2) for any c > 0. This holds for c = 0 and follows by 
differentiating both sides for 0 < c < (2/m)'/*. For larger c it follows from 
®(—c) < o(c)/c (RAP, Lemma 12.1.6(a)), as does the right side of (b). For 
the left side of (b), note that @ is a convex function for x > 1 since there 
o" (x) = (x? — 1)b(x) > 0. Thus, the region between the graph of ¢ and the 
x axis for x > c includes a right triangle with right vertex at (c, 0), a vertex 
at (c, (c)), and whose hypotenuse is along the tangent line to the graph of 
¢ at c. This triangle is easily seen to have area $(c)/(2c), which finishes the 
proof. 


This section will prove an extension of inequality (a) to infinite-dimensional 
Gaussian variables such as those taking values in separable Banach spaces. It 
will be said that a law P on a separable Banach space (X, || - ||) has mean 0 
if f |x ||d P(x) < œ and f f(x)d P(x) = 0 for each f € X’. Recall the dual 
norm || f ||’ := sup{|f(x)| : ||x|| < 1}. Here is one of the main results. 


Theorem 2.6 (Landau-Shepp—Marcus-—Fernique) Let P be a normal law 
with mean 0 ona separable Banach space X. For f € X' leta?(f) := f f°dP. 
Then t? := sup{o (f): || fll’ < 1} < œ and 


J verxDaro <oo forany a< 1/(21°). 


By Proposition 2.5(a), the theorem holds in the one-dimensional case, and 
by the left side of part (b), the condition œ < 1/(2t7) is best possible. Before 
the theorem is proved in general, some other facts will be brought in. 


Definition Let X be a real vector space and B a o-algebra of subsets of X. 
Then (X, B) is called a measurable vector space if both 


(a) Addition is jointly measurable from X x X to X, and 
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(b) Scalar multiplication is jointly measurable from R x X to X (for the 
usual Borel o-algebra on R). 


Example. Let X be a topological vector space, namely, a vector space with a 
topology for which (a) and (b) hold with “measurable” replaced by “continu- 
ous.” Suppose the topology of X is metrizable and separable. For a Cartesian 
product of two separable metric spaces, since their topologies have countable 
bases, the Borel o-algebra in the product equals the product o-algebra of the 
Borel o-algebras in the two spaces (RAP, Proposition 4.1.7). Thus X with its 
Borel o-algebra is a measurable vector space. 


The notion of normal law cannot be defined for general measurable vector 
spaces by way of linear forms, as it was for Banach spaces in the last section, 
since there exist measurable vector spaces, such as spaces L?[0, 1] forO < p < 
1, which have nontrivial normal measures but turn out to have no nontrivial 
measurable linear forms (Appendix F). Fernique (1970) proposed the following 
ingenious definition: 


Definition A probability measure P on a measurable vector space (X, B) will 
be called centered Gaussian iff for variables U and V independent with law 
P (say, coordinates on the product X x X for the product law P x P) and 
any 6 with O0 < 0 < 27x, U cos0 + V sin and —U sin@ + V cos 6@ are also 
independent with distribution P. 


If X =R, the transformation of (U, V) € R? in the last definition is a 
rotation through an angle 0. Normal laws with mean 0 on finite-dimensional 
real vector spaces are centered Gaussian in this sense, as can be seen from 
covariances. Conversely, a law on X = R satisfying the above definition of 
“centered Gaussian,” even for one value of 0 with sin(26) 4 0, must be normal 
according to the “Darmois—Skitovic” theorem; see the notes for this section. 
We will not need the full strength of the latter theorem below, but the following 
facts will be proved. 


Proposition 2.7 Let (X, A) and (Y, B) be two measurable vector spaces. Let 
P be a centered Gaussian measure on (X, A) and T a measurable linear map 
from X into Y. Then the image measure Q := P o T`! is centered Gaussian 


on (Y, B), 


Proof. Let (x, €) € X x X with distribution P x P. Then (Tx, Ty) has dis- 
tribution Q x Q on Y x Y. Also for each 90, ((cos 0)x + (sin @)é, (— sin 0)x + 
(cos 9)&) has distribution P x P on X x X. Now 


((cos0)Tx + (sin@)Té, (— sin 0)Tx + (cos 6)Ty) 
= (T((cos 0)x + (sin@)é), T((— sin @)x + (cos 0)£)) 


has distribution Q x Q on Y x Y, so Q is centered Gaussian. 
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Proposition 2.8 For any finite dimension d, a centered Gaussian law P on R! 
with Borel o -algebra is a centered normal law in the usual sense, N(O, X) for 
some nonnegative definite symmetric matrix X. 


Proof. First suppose d = 1. Then P x P on R? is invariant under all rotations. 
A rotation through 0 = x shows that P is symmetric, d P(x) = dP(—x). Let 
f be the characteristic function of P, f(t) := f°. ed P(x). Then f is real- 
valued, f(0) = 1, and f(t) = f(—t). Any point (t, u) € R? can be rotated 
to a point on a coordinate axis, so f(t) f(u) = f(t? +.u7)!/). Let h(t) := 
log f(\t|'/*) where it is defined and finite, i.e. where f > 0, as is true at least 
in a neighborhood of 0. Then A(t + u) = A(t) + h(u) for t, u > O and, perhaps, 
small enough. Where both sides are defined and finite we have h(qu) = gh(u) 
first when q is an integer, then when it is rational, and then for general real q 
by continuity. Since h thus is bounded on finite intervals where it is defined, it 
is defined and continuous on the whole line, with h(t) = ct for some constant 
c=h(1), so fy) = exp(ct?) for all t, and c < 0 since | f(t)| < 1. Thus P = 
N(0, o?) where o? = —2c (RAP, Proposition 9.4.2, Theorem 9.5.1). 

Now for general d, for any linear form f from Rf into R, P o f7! is centered 
Gaussian by Proposition 2.7 and thus is a law N(0, o?) by the d = 1 case. It 
follows that P is a law N(O, £) by RAP, Theorem 9.5.13. 


Given anormal measure P = N(0, C) ona finite-dimensional space X and a 
vector subspace Y of X, it follows from the structure of normal measures (RAP, 
Theorem 9.5.7) that P(Y) = 0 or 1. This fact extends to general measurable 
vector spaces: 


Theorem 2.9 (0-1 law) Let (X, B) be a measurable vector space and Y a vector 
subspace with Y € B. Then for any centered Gaussian law P on X, P(Y) =0 
or 1. 


Proof. Let U and V be independent in X with law P. For 0 < 0 < x/2, let 
A(@) be the event 


A(0) := {U cos 0 + Vsiné € Y, —Usind+ Vcosé ¢ Y}. 
If 0 ~ ¢, withO < @ < 7/2, then 
cos 0 sing — sin 0 cos ġ = sin(ġ — 0) # 0, 


so if yı := u cos 0 + vsin0 € Y and yz := u cos ġ + using € Y, then u and 
v can be solved for as linear combinations of yı, y2, so they are in Y and 
—u sin + vcosð € Y, —u sin ġ + vcosġ € Y. So the sets A(0) are disjoint 
for different values of 0 € [0, 2/2]. By definition of centered Gaussian, these 
sets all have the same probability, which thus must be 0. Taking 6 = 0 gives 
0 = Pr(U e Y)Pr(V ¢ Y) = P(Y)P(X\Y), so P(Y) = Oor 1. 
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A measurable function || - || from a measurable vector space X into [0, oo] 
will be called a pseudo-seminorm iff Y := {x € X: ||x|| < co} is a vector 
subspace of X and || - || is a seminorm on Y, that is, ||cx]|| = |c|||x|| for each real 
cand x € Y, and so for all x € X, with 0 - œœ := 0, and ||x + y|| < ||x|| + Ily |l 
for all x, y € Y, and so for all x, y € X. 

By the 0-1 law (Theorem 2.9), for any pseudo-seminorm || - || and centered 
Gaussian P on X, P(||- || < œ) = 0 or 1. Likewise, P(| || = 0) = 0 or 1. 

If S is a countable set, then R5, the set of all real-valued functions on S, 
with product topology, is a separable metric topological linear space, hence 
a measurable vector space. If P is the law of a Gaussian stochastic process 
{x,, t € S} on R5, with Ex, = 0 for all t € S, then P is centered Gaussian 
on R5. The supremum “norm” ||{y;, t € S}H| := sup, |y;| is clearly a pseudo- 
seminorm on RS. 

Here is a step toward proving Theorem 2.6: 


Lemma 2.10 (Landau-Shepp-Fernique) Let (X, B) be a measurable vector 
space, P a centered Gaussian measure, and ||- || a pseudo-seminorm on X 
with P(||- || < co) > 0. Then for some £ > 0, 


[ow («|| ||”) dP(x)<oo for 0<a<e. 


Proof. As noted above, P(|| - || < oo) = 1. Let U and V be independent with 
distribution P in X. The definition of centered Gaussian for 0 = —7/4 yields, 
for any real s and t, 


PAILS SPU > 
=Pr{\\(U —V)/2'7| <s, KU +V)/2'7, >t}. 25) 


Note that 
2"? min(|U II, WWII) = WU + V)/2" I] — IU — v)/2"7 I, 


where the event that the right side is undefined, equaling oo — ov, has zero 
probability and so can be neglected. Thus on the event on the right in (2.5), we 
have ||U|| > (t — s)/2!/? and ||V|| > (t — s)/2!/?. So 


2 
P(l-<s)PU- > D < P (> @—s)/2"”)". (2.6) 
Choose s with 0 < s < œœ large enough so that q := P(||- || < s) > 1/2. 


Define a sequence t, recursively by to := S, thy) := S + 2'/74,,n=0,1,.... 
Then we have by induction 


ty = (2/7 + 1) (2022 — 1) 5. 
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So t, increases up to +00 with n. Let x, := P(|| - || > th)/q, SO Xn < a By 


induction, we then have 


P(il- >t) <q(—4q)/qy . 


It follows that 


CO 
E exp (all 1P) < ge +9 P(t < I+] S maid exp (ate 1) 
n=0 


where the nth term of the latter sum is bounded above by 
q (q7! _ 1)" ov E (212 ma 1)” (eer E ga] ; 


The sum will be finite if 


oo 
X exp {2" oe 14 +4 (2? + 1)°as*|| < œ, 
n=0 q 


which holds for œ < (log To) /(24s*), proving the Lemma. 


Theorem 2.6 will be a corollary of the following fact: 


Theorem 2.11 Let (X, B) be a measurable vector space and P a cen- 
tered Gaussian law on X. Let {Yn}n>1 be a sequence of measurable lin- 
ear forms: X — R. Let ||x|| := sup, |y,(x)|. Suppose P(||x|| < oo) > 0. 
Then t := (sup, f y2dP)'/? < oo, and E exp(al|x||?) < co if and only if 
a< ae), 


Proof. For each n, ||- || > |ynl. It is easily checked that P o yọ! is cen- 
2 


tered Gaussian and thus by Proposition 2.8 is a law N(0, 02), o? < t’. 
Now P(|y,| > on) > c > O for alln (c > 0.3). Thus if o, are unbounded, 
P(|l'|| = on) > c for all n gives a contradiction. Then to prove “only if,’ we 
have E exp (|yn|?/(2t7)) = t/(t? — 02)!/?, or = +00 if o? = t°. Taking the 
supremum over n gives E exp(||x||?/(2t7)) = +00. 

Now to prove “if, recall the space £°° of all bounded sequences of real 
numbers with supremum norm. This is a nonseparable Banach space. With 
the smallest o-algebra making the coordinates measurable, it is a measurable 
vector space (by the way, this o-algebra is smaller than the Borel o-algebra 
for the supremum norm). Let Y (x) := {yn (x)}n>1. Let S be the vector subspace 
of X where ||- || is finite. Then P(S) = 1 by Theorem 2.9 and Y is linear, 
measurable, and preserves norms from S into 2°. So it will be enough to prove 
the theorem in £% with coordinates {y,}. 

For any finite k, for Th: xb ies) ae Po po is centered Gaussian 
on RÝ by Proposition 2.7, thus it is some normal law N(0, £+) by Proposition 
2.8. By Gram-Schmidt orthonormalization in L?(P) (RAP, 5.4.6) we can write 
Yn = A anjg; for all n, where g; are linear functions on £° (finite linear 
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combinations of coordinates), are orthonormal in L?(P) and are normally 
distributed, so they are iid. N(O, 1). If the y, are linearly independent in 
L?(P), then m(n) = n. Otherwise, m(n + 1) = m(n) + 1 or m(n) according as 
Yn-+1 İS or is not linearly independent of the y; for j < n. 

Let an; =O for j > m(n). Each g; is in turn a linear combination of 
Y1, +--+; Yn for the least n such that m(n) > i. 

For each n, f y,dP = X; a}; < t’. For k = 1,2,..., n = 1,2,..., let 
Ven = pare: anjgj. Since anj = 0 for j > n, the sum defining V;, runs over 
k < j <n, and there is no problem of convergence. Let B_; be the smallest 
o-algebra for which g; are measurable for all j > k. Then for any 0 < j < k 
and n, we have Vin = E(Vjn|B_x). Let || Vel] := sup, |Vin| < +00. Then for 
a > 0 we have the inequalities 


exp (æl Vill’) = exp (« sup Vil?) exp (o sup [EV 1B-0)P) 


IA 


exp («1E¢up vall- = exp (a{E((|V;|||B_x)}”) 


IA 


E (exp(all Vj DIB) 


by conditional Jensen’s inequality (RAP, 10.2.7) if the expectations are finite. 
First, taking j = 0, E exp(a|| Voll?) < œœ for some œ > 0by Lemma 2.10. Then 
for j = 0 < k, the inequalities hold and give for that œ, E exp (æli Vill?) < 
E exp (æli Voll?) < oo forall k. Sofor almostall y € 2°, Vey) := {Vkn(y)}n>1 
e L. Let Wy := expla || Vi ||), k =0,1,... . Then by the inequalities for 
general 0 < j < k, {((W;, Bj): j =...,—2, —1, 0} is a submartingale (RAP, 
Section 10.3) and in view of its index set, a reversed submartingale. For any 
s > 0 and finite k, by the Doob maximal inequality (RAP, 10.4.2), 


2 
Prí max IV; >s} = Pr | max W_j > exp (as | 


IA 


EWo/ exp (as*) < co 


by choice of a. Choose s large enough so that E Wo exp(—as) < 1/2. Then 
letting k — œœ, 


Pr(lim sup || V;|| > s) < Pr(sup || V; > s) < 1/2. 
joo j 

Now lim sup; || Vj|| is measurable for the “tail o-algebra” N; B_j. Thus 

Pr(lim sup; || Vill > s) = 0 by the Kolmogorov 0-1 law (RAP, 8.4.4). 

Then for 0 < ¢ < 1/2, there is a k(£€) < œo such that Pr(||Vi|| > s) < € 

for k > k(e) since if Pr(||V;|| > s) > œ for infinitely many values of k, 

then Pr(lim sup,_, || Vell > s) > ¢. Then by the last line of the proof of 
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Lemma 2.10, 
E (ex (Bll Vell”) < œ for < Lia A (2.7) 
Pi S a a ae 


Now take any a with 0 < æ < 1/(21?). Choose y witha < y < 1/(2t7), then 
€ > 0 small enough so that 


a ay log((1 — €)/e) 
C= (yl/2 — ql/2y2 < 2452 ` (2.8) 
Let k = k(e) and U;(y) := y — Vi(y). Then 


gif 


A 


lly Sa? OI + aP VO 
= (a/y) Py PUO + (1 - aP) PION. 
Since t > exp(t?) is a convex function (RAP, Section 6.3, see Problem 1), 
exp (allyl?) < @/y)"” exp (y Uk ll’) 
+ (1 —(a@/y)"”) exp (6 | Vill”) . 


By (2.7) and (2.8), E exp(¢|| Vi\|?) < oo. Now, by the Cauchy inequality, for 
each n, 


k A k k 
UŽ, = (Zea) = (£) a i 
i 1 


i=l i=l j= 


sO 


k k 
E (exp (y|Ucll’)) < E 4 exp (28) sup ) az, 
z j 


: j= 


i=1 
k 
E exp {( s) ye] = (1 — 2yr?" < o. 
i=1 


So Theorem 2.11 is proved. 


IA 


Proof of Theorem 2.6. Let {x,}°°., be dense in X. For each n, by the Hahn- 
Banach theorem and a corollary (RAP, 6.1.5) there is a y, € X’ with ||y,||/ = 1 
and |y,(xn)| = ||xn||. Then for all x € X, we have sup, |yn(x)| = ||x||, so the 
Theorem follows from Theorem 2.11. 


2.4 Inequalities and Comparisons for Gaussian Distributions 


The main result of this section will show that if a set of Gaussian random 
variables is large enough in the sense defined in Section 1.2, meaning that 
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the number of variables more than ¢ apart grows rather fast as € | O, then it is 
almost surely unbounded. The main steps in the proof will be some inequalities, 
one due to Slepian and another to Sudakov and Chevet. 

In the next proof, the following well-known relations will be used. For any 
random variable Y > 0 with distribution function G we have 


[oer nar= f 1-00ar= f f aoar (2.9) 
0 0 0 Jr 


a [ draco)= f ydG(y) = EY. 
0 0 0 


If Y is any random variable with a finite expectation, let Yt := max(Y, 0) > 0 


and Y~ := — min(Y, 0) > 0, so that Y = Y+ — Y~ and we get 
+00 +00 
EY = Í Pr(Y > t)dt -f Pr(Y < —t)dt. (2.10) 
0 0 

Theorem 2.12 (Slepian’s inequality) Let X),..., X, be real random varia- 
bles with anormal joint distribution N(O, r) on R". For any real 1, ..., An, let 
P(r) := Pry, {Aj Vin) i= Pr{ X; > 4; forall j = 1,...,n}. Let q be another 
covariance matrix, with rii = qi foralli = 1,...,nandr;; > qij for alli and 
j. Then P,(r) > P,(q). 

If (X1,..., Xn) have distribution N (0, r) and (Y, . . . , Y„) have distribution 
N(0, q) then E max(Xı, ..., Xn) < E max(Y;, ..., Yn). 


Proof. If for some i, r;; = qii = 0, then P,(r) = P,(g) = Oif A; > 0, whereas 
if A; < 0, then X; > à; holds with probability 1 for both distributions, and we 
can eliminate X; and reduce n until r; = qi; > 0 for all i. 

Suppose next that r is nonsingular, so that it is strictly positive definite and 
N(0, r) has a density g, given by Fourier inversion of its characteristic function 
as 


0° 0° 
Bn(X1,--+5Xn) = Bn(X1,---, X05) = am" f se} 
—00 —0o 


n 


n 
. 1 
-exp -iY xti z ) Timtktm | dt, <- -dtn 
j=l 


k,m=1 


(RAP, Theorem 9.5.4). Since r is symmetric, it is given by the n(n + 1)/2 
variables rym, 1 < k < m <n. The partial derivatives a7 gn /Ox~OXm can be 
evaluated by differentiating under the integral signs, applying Corollary A.7 
(in Appendix A below) twice, thus multiplying the integrand by —tktm (RAP, 
Theorem 9.4.4). The same integral results from taking 0g,/0rgm for k < m, 
where 0/0r, can be taken under the integral sign by Proposition A. 16 (since 
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r is strictly positive definite). So 
dgn /IXkðXm = 028n/Orkm, kAM. (2.11) 


Now 
[0,6] CO 
P(r) = J -f Bn(X1,---, Xn) dx1...dXy (2.12) 
E M 


where g,(x) = (2m)~"/*(det r)~!/? exp((—r7!x, x)/2) (RAP, 9.5.8). For a pos- 
itive definite symmetric matrix s, by Proposition A.16, applied to t = s;; and 
w(x) = xx; (or 47/2 if i = j), the integral of exp(—(sx, x)/2) over any mea- 
surable region in R”, specifically the orthant {A; < x; < oo, i=1,...,n}, 
can be differentiated under the integral sign with respect to any component of 
s. Then, since the functions r > r~! and r > (detr)! 
the set of symmetric, (strictly) positive definite matrices, the integral (2.12) can 
be differentiated under the integral sign with respect to any rgm, k < m. Then 
by (2.11), we need to evaluate i n Ta 3? gn /3XkðXm dx, ...dX,. Since the 
integrand is absolutely integrable, we can do the integrations in any order 
and replace i T dxypdxm by limy-+co is ia dx,dx. Here we may as 
well assume that k = 1, m = 2. Now since g, is a smooth function, for 
M > max(Ay, Az), 


M pM 
/ f 3’ g,/əx1ðx2 dxıdx2 
ho Ay 


= ¢,(M, M, x3, «s3 Xn) — BnM, A)2, x3, saaya) 


are smooth for r in 


— 8gn(à1, M, x3, wees Xn) + BnlAq, À2, X3, ag i) 


> Bn(Aq, À2, X3, ..., Xn) as M > œœ. Thus 
oe) [0,6] 
ð Pa (r)/ðri2 = i | Bn(Aq, A2,X3,---, Xp) dx3...dx, > 0. 
An A3 


Likewise, 0P,(r)/0rgm => 0 for all k < m. 


In the general case, let ¢ > 0 and 0 < A < 1. Let J be the identity matrix 
and p := Ar + (1 — A)q + £I, which is positive definite. Then 


(Tem = qkm) = 0. 


dP,(p) = 5 OP, (p) d Pim = > dP, (p) 
dx kem OPkm di OPkm 


k<m 


Integrating from 0 to 1 with respect to A gives P,(r + e1) > P,(q + £1). The 
laws N(O,r + £I) converge to N(0O,r) as € | 0. (To prove this, let X and Y be 
independent with L(X) = N(0, r) and L(Y) = N(O, J). Then L(X + VEY) = 
N(O,r+el) and as £ 0, X+./eY — X a.s., hence in law.) Likewise, 
N(0, q + £I) converges to N(0, q). Since rkk = qkk > O for all k, for P = 
N(O,r) or N(O, g), P(X = Ax) = 0, and the boundary of the orthant {X, > Ax 
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for k = 1,..., n} has zero probability. Then, using the portmanteau theorem 
(RAP, 11.1.1(d)) it follows that P, (r) > P,(q). 

Taking all 4; = —A and applying the conclusion to (—X1,..., — Xn), which 


does not change the distribution, we get 


NCO, r){ max Xj; <A) = NO, q)( max Xj <A). 
sjan <j<n 


Then applying (2.10) gives the last conclusion. 


Example Let 7;; = 1, i, j = 1,2,3, and let q be the 3 x 3 identity matrix. 
Let s = r except that 513 = s31 = 0. Then s is not nonnegative definite, in other 
words it is not a covariance matrix, because s;; = 1 for {i, j} 4 {1, 3} implies 
Xı = X2 = X3 a.s. witha N(O, 1) distribution while s;3 = s31 = Oimplies X, 
and X3 are independent. 

Thus, in the above proof, if we change rkm tO qkm for one pair of indices k, m 
at atime, itis possible to pass through values of the matrix which are not positive 
definite. Then, the “normal density” g,(x1,..., Xn, r) is not well-defined, and 
integrals of its “characteristic function” diverge. 


Recall that the correlation (coefficient) r(X, Y) of two nonconstant variables 
X and Y with finite second moments is defined by 


r(X, Y) := E(X — EXXY — EY))/(oxoy) 
where oy := o (X) is the standard deviation (E(X — EX)*)'/”. 


Corollary 2.13 Let X\,...,X, and Y,,..., Y, be two sets of jointly normally 
distributed variables with mean 0, o(X;) > 0 and o(Y;) > 0 for all i, and 
r(X;, Xj) =r, Y;) for alli 4 j =1,...,n. Then 


Pr{X; >0, i=1,...,n} > Pr{¥, > 0, i=1,...,n)}. 


Proof. Replacing each X; by X;/o(X;) and Y; by Y;/o(¥;) does not change 
the events being considered or the correlations, and gives covariances to which 
Slepian’s inequality applies. 


Let C be a jointly Gaussian set of random variables with mean 0, 
and with the £? metric d(X, Y) := (E(X — Y)". Recall that for € > 0, 
D(e, C) := De, C,d) := sup{n: for some X1,..., Xn E€ C, d(X;, Xj) > 
e€, 1 <i < j <n}. Inthe following, by Theorem 1.9, D(e, C) could be replaced 
equivalently by N (£, C, d). An inequality called the Sudakov minoration (The- 
orem 2.22 below) is closely related. V. N. Sudakov around 1965 first discovered 
a fact close to the following. S. Chevet in 1970 first published a proof. 


Theorem 2.14 (Sudakov—Chevet) /f lim sup, |) ¢* log D(e, C) = +00, then 
sup{|X|: X € C} = +œ almost surely. 
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Proof. Let ® be the standard normal distribution function and G(x) := 1 — 
(x). Most of the proof is in the next inequality. 


Lemma 2.15 (Chevet) Let X),..., Xn be jointly normally distributed with 
mean 0 and such that for some M < œ and some e with 0 < € < 1, we have 
d(0, X;) < M for all j and d(X;, X;) > £ for alli # j. Let K :=2'/?(M? + 


1). Then 
G(1)Pr{X; <1, j=1,...,n} 
CO 
< 2l On ee f exp(—t?/2)®(Kt/e)"dt. 
0 
Proof. Let e;, ..., e be orthonormal variables such that X1, .. . , Xn are in their 


linear span, by the Gram-Schmidt process, RAP 5.4.6. Then e1, ..., €n are i.i.d. 
N(O, 1). Let e+; be another N(O, 1) variable independent of e),..., en. Let 
Y; := Xi — @n41, i= 1,...,n. Then 


G()Pr{X; <1, j=1,...,n} 


= Prfemı > 1 and X;<1,1<j <n} < Pr{¥; <0, i=1,...,n}. 


Let bij := r(¥;, Y;) and ||Y|| := d(0, Y). Let 6;; be the angle between Y; and 
Y; at 0, so that b;; = cos(6;;). Let U := X; and V := X; withi Æ j. Then 
u := |U] < M, v := ||V|| < M, and ||U — V|| > £. So 


e < ||U— VI? = u? — 2U, V) + v?, 
and 2(U, V) < u? + v? — e°. Thus 


KU, V) +12 = (U, V + XU, V)+ 1 


< uv tu to Hle, 


SO 
je U l e? i 
T Tu? +12 + D2 7 (u2 + 1)(v2 + 1) 
e2 


<1 = 1—¢7/K? < 1/0 +8/K”). 
< IM De e/K° < 1/U+e°/K*) 
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Let fp = €@:/K =a fij = rf, fj) = 1/0 + ¢7/K’) fori Æ j, bi = 
fii := 1. Then Slepian’s inequality (above), specifically Corollary 2.13, gives 
Pr{Y; <0, i=1,...,n} < Pr{f, <0, i=1,...,n} 


= Pr{ee;/K < en41, i=1,...,n} 


[0,6] 
= Qr” f Pr{e; < tK/e, i = 1, ...,n}exp(—t?°/2Ddt 
CO 


= 2r)? f i (1K Je)" exp(—t7/2)dt 


= (2x)! if i exp(—1?/2)[®"(Kt/e) + ®"(—Kt/s)]dt 
0 


[0,6] 
<2! 4 amy" f exp(—1?/2)"(Kt/e)dt, 
0 


proving the Lemma. 


Now to prove Theorem 2.14, take e | 0 such that e? log D(ex, C) > k, k = 
1,2,.... In Lemma 2.15 let € = €, n = D(eg, C). If the variances of the 
variables in C are unbounded, let X(n) € C with o(X(n)) > n. Then clearly 
|X(n)| are unbounded a.s. So we can assume that for some M < œ, o (X) < M 
forall X € C. 

For 1 < B < œo, the probability that |X;| < B for j = 1,...,n is the prob- 
ability that |X;/B| < 1 for each j. To apply Lemma 2.15, we can then replace 
Ek by £€k/B, giving the same bound except for replacing Kt/e, by s/ex where 
s := KtB. 

Since ®” = (1 — G)” < e™”® , it will be enough using the dominated con- 
vergence theorem to show that D(e,, C)G(s/£k) > +00 as k —> oo for every 
s > 0. Asx — +00, G(x) ~ (2m) !/2x7! exp(—x?/2) (RAP, Lemma 12.1.6), 
so for k large, G(s/ex) > €(3s)7! exp(—(s/e;)*/2). So it is now enough to 
prove that log D(e;, C) + log €g — ce,” — +00 as k > oo for any c < œœ. 
Now loge > —e~? for small £ > 0, so the log ¢, term can be removed. By 
assumption, Er log D(ex, C) — c > +00, and Theorem 2.14 follows on multi- 
plying by e — +00 also. 


Example Let G,, be i.i.d. N(O, 1) variables and X, := G,/(ogn)'/?, n> 
2. Then d(X;, Xx) > (log We for j <k, so Dé, {Xj}2<j<n) >n—1 
if e < (log(n — 1))7 1. So for 0<e<1, De, {Xj}j>2) = [exp(e~7)] > 
exp(e~)/2 where [x] denotes the greatest integer < x. So log D(e, {Xj} j>2) = 
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e7? — log 2 > £7?/2 for 0 < e < 1/2. On the other hand for each n, Propo- 
sition 2.5 gives Pr{|X,| > 2} < exp(—2 logn) = 1/n?, and so by the Borel- 
Cantelli Lemma, lim sup,,_,,, |Xn| < 2 and sup, |X,| < +00 a.s., so Theorem 
2.14 is sharp. 


Here is a further fact related to Slepian’s inequality that avoids the rather 
restrictive assumption rj; = qii: 


Theorem 2.16 Let N(O, C) and N(0, D) be two normal measures with mean 
0 on R". Let X = {X;}"_, have law N(0, C), and let Y = {Y;}?_, have 
law N(O, D). Suppose that for alli, j =1,...,n, we have E((Y; — Y;)’) < 
EC(X; — Xj"), in other words for each i, j, 


Di + Dj; —2Di; < Cü + Cjj — 2Cj;. (2.13) 
Then 


(a) E{max <j, j<n(¥i — Yj)} < E{maxy<;,j<,(X; — Xj)} and 
(b) Emax; Y; < Emax; Xi. 


Proof. First, it will be shown that we can assume C and D are nonsingular. 
Suppose the Theorem holds in that case, and let C, D be possibly singular. 
Then for any t > 0, C + t?I and D + t?I are nonsingular and the hypotheses 
hold for them, thus the conclusion. Now, N(0, C + t7/) is the law of X + tZ 
where Z has law N(0, J) and is independent of X, and likewise for D and Y. 
Then, letting t}0, the Theorem follows for C, D. 

Suppose then that C and D are nonsingular. For 0 < à < 1 let Cy, := 
AC + (1—A)D and let g(A) := Emax)<;,j<, Ui — U; where U := {Ui} 
has law N(O, C,,). It will be enough to show that dg(à)/dà > 0 for0 < A < 1, 
taking a right derivative at O and a left derivative at 1. We have g(A) = 
f mMaXı<i, j<n(x; — xj) fa(x)dx where fy is the density of N(0, C,), namely, 


f(x) = 20)" (detC,)~'/? exp(—(C7'x, x)/2). 


Since C and D are nonsingular, they are strictly positive definite. Thus 
for some y > 0, (Ciy, y) = yiyi? for all y and O < à < 1. Thus det C, is 
bounded below and (det C,)~!/? is bounded above for 0 < A < 1. Here y 
can also be chosen so that (Cs x) > y|x|? for all x € R” andO<A <1. 
The mappings à > C, and A> C;' are both smooth (C°), and the map 
(C, x) > (det C)~!/? exp(—(Cx, x)) is asmooth map on the open set of strictly 
positive definite symmetric matrices C and all x € R”. It follows that the func- 
tions f,(x) and their difference-quotients and partial derivatives with respect 
to à and x are uniformly integrable on R” with respect to x and remain so if 
multiplied by any function bounded above in absolute value by a polynomial. 
Thus by Theorem A.2 of Appendix A, we can differentiate under the integral 
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sign: 
dg(r 
48) = f max (x; — X5)(0fx(x)/0A)dx. (2.14) 
dx l<r,s<n 


As in (2.11) we have a? f,./dxz = 20f,/Orcm for k =1,...,n. By this and 
(2.11) itself we have 


n 


Ofi(x) 1 y Oi Ef 
aa o 244, dà ðxðxj 


(2.15) 


where d(C,)ij/dA = Cjj — Dij. Let S := S(R") be the space of all C% 
functions f from R” into R such that for every polynomial P(-) and every 


n-tuple p := (pı, ..., Pn) of nonnegative integers with |p| := pi t+... Dn, 
sup, |P(x)D? f(x)| < co where D? f(x) := alPl f(x)/axy" ... 0x2". After 
multiplying P(x) by some power of 1 + |x|? where |x|? := x? fe +2, 


we see that each P(x)D? f(x) is integrable on R”. Here S(R”) is Laurent 
Schwartz’s space of rapidly decreasing functions. It is easily seen that any 
normal density such as f} is in S(R”). 


Let u(x) := max; x; and for each i=1,...,n let v; := vu,(x) := 
max; xj. Let dx/dx; := Wj4;dx;. For any i # j define a function g;j 
by gix) = {Yr}; where y, := x, forr Ai and y; := yj. Let 


Vij be the measure defined for suitable functions ¢ on R” by f dVi; := 
Sag P(8ij MX bei dx /dx; where A(j) is the set of {x,},4; such that x, < xj 
for all r Æ i, or equivalently v; = xj. In other words, Vj; is a measure on the 
subset where x, < x; for all r of the (n — 1)-dimensional hyperplane {x; = xj}, 
given by dx /dx; or equivalently by dx/dx;. On A(j), as an image by a linear 
map ¢;; of Lebesgue measure on R"~', V;; is a multiple (by 271/2) of Lebesgue 
measure on the hyperplane {x; = x;}. Thus, Vi; = Vj; forl <i <j <n. 


Lemma 2.17 For any ¢ € S(R"), 


(a) For anyi = 1,...,n, f u(x\(b/dx;)dx = =f se d(x)dx. 
(b) For anyi # j with | <i, j <n, 


fu(x)(0°@/dx;Ax))dx = — f pdV;j. (2.16) 
(c) For eachi =1,...,n, 
fu(x)(8°@/Ax2)dx = S O(x)|xj=,dx/dx;. (2.17) 


Proof. For (a), we have f u(x)(d¢/dx;)dx = f(u — v;)(x)(0b/dx;)dx since 
v; does not depend on x;. Integrating with respect to x; first we get 


SS 0 d 
Tf (xi = aes a $ 
wt Ox; dxi 
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Integrating by parts in the inner integral, the boundary terms are 0 both at 
x; = v; and at x; = co since ¢ € S, so (a) follows. 

Then to prove (b), apply (a) to d@/dx;. Integrating first with respect to xj, 
(b) follows. 

For (c), applying (a) to d@/0x; and integrating first with respect to x; gives 
(c). 


Now continuing the proof of Theorem 2.16, note that we have max;,;(X; — 
Xj) = max; X; — min; X; and min; X; = — max(— X j). Normal laws with 
mean 0 are symmetric, so (X; Y= and {Xj F= have the same law. Thus, 


E max(X; — X;) = 2E max Xj. (2.18) 
i,j i 
Then, from equations (2.14), (2.15), (2.18), (2.16), and (2.17), we get 
Lag yO op f havy+ YC Di) f AO E 
Sa S ij — Gij ij ii — Dii X) i xj=0; 7. 
dh l<i<j<n : ’ = i=l : © dxi 


Now, for Lebesgue measure dx/dx; on {xj} jz, each set {x; = x,} for j Ai F 
r Æ j has measure 0, and thus >> ji lu=x; = 1 almost everywhere. So 


I Aludd =) f fidVij, 


iti 
and 


n 


d 
= = 2D ys — Ci + Ci — Di) f fad Vij. 


i=l jži 


Symmetrizing the sum, interchanging j and i, gives 


d Š 
A = 5 VG + Cjj + 2Dij — 2Cij — Dii — Di) f fidVij 2 9 
i=l jži 
by (2.13). Thus (a) of Theorem 2.16 is proved. Then since (2.18) also holds for 
Y, (b) follows and Theorem 2.16 is proved. 


Here is a another fact related to Slepian’s inequality. 


Theorem 2.18 Let T be a set and {X;};e7 and {Y;}rer two Gaussian pro- 
cesses with mean 0. Let dx(s, t) := E((X, — X}! and likewise define dy. 
Suppose that dx(s,t) < dy(s, t) for all s,t € T, and that T has a countable 
subset S dense for dy, thus also for dy. Suppose E sup,es Yı < 00. Then 
ess. SUPjer Yr := {Yi} er = SUP;ey Yı is defined up to almost sure equality for 
all countable dy-dense subsets U of T, and 


EX er SE ee < OO. (2.19) 
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We also have 
E(ess. sup, ser Xs — X+) < E(ess. sup, rer Ys — Yı). (2.20) 


Also, let {|X;|}*-7 := ess. sUP;er |X;| and likewise for Y,. If infres |X;| = 0 
almost surely, then 


E(\Xilbier < 2E(IViler < ©. (2.21) 


Proof. For (2.19) one can apply Theorem 2.16(b) for finite sets increasing up 
to S (or U) and use the proof of Lemma 2.4. 

For (2.20), since {— X;} is equal in distribution to {X;}, we have E ess. sup, <7 
X, = E ess. SUPper — Xs and the left side of (2.20) equals 2E ess. sup,er Xs. 
Doing the same with Y,, we get that (2.20) follows from (2.19). 

To prove (2.21), foreach w and each s € T, we have Xs < sup,cs Xs — Xr, 
taking t € S with X;(w) — 0. Likewise, we have —X; < sup,cs Xs — X, for 
s € Ssuch that X,(w) — 0. It follows that ess. sup,er |X;| < ess. sup, rer Xs — 
X,. Since Y, — Y, < |Y,|+|Y¥;|, we get (2.21). 


2.5 Sample Boundedness 


First is a nice characterization of the GB property, due to Sudakov (1971, 
1973). 


Theorem 2.19 Let C be a subset of a Hilbert space H. Then C is a GB-set if 
and only if EL(C)* < +œ. 


Proof. If C is nonseparable then by Theorem 2.14, it has a countable subset 
B which is not a GB-set, so L(C)* > L(B)* = +00 almost surely, and the 
equivalence holds. So suppose C is separable. Let B be a countable dense subset 
of C. Then L(B)* = L(C)* a.s. by the proof of Lemma 2.4. Let B = {xj} j>1. 
If E sup; L(x;) < +00, then sup; L(x;) < +00 almost surely, and since —L 
is a version of L, also inf; L(x;) > —oo almost surely, so B and hence C is a 
GB-set. Conversely suppose B is a GB-set. Define a probability measure Pg 
on the measurable linear space £% of all bounded sequences {y;}j>1 of real 
numbers with the smallest o-algebra making all the coordinates measurable, 
where {yj;}j>1 have the joint distribution of {L(x;)}j>1. By Theorem 2.11, 
E sup; |L(xj;)| < +00. 


One can make a stronger statement. Let g be a convex, increasing function 
from [0, oo) onto itself. If Y is a random variable such that Eg(ôY) < co 
for some ô > 0, let ||Y|lp := inf{c > 0: Eg(|Y|/c) < 1}. Then |||, is a 
seminorm on such random variables (Appendix H). If there is no such ô > 0, 
let ||Y |l := +00. 
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Theorem 2.20 There is an absolute constant M < œ such that for any subset 
C of a Hilbert space H, and g(x) := exp(x?°) — 1, 


IL(C)" lle < WIL lle < MECL(C)I"). 


Proof. If C is not a GB-set, all three expressions will be infinite, so suppose C 
is a GB-set, which we can then take to be countable by Theorem 2.14. Then by 
Theorem 2.11 as in the previous proof, || |L(C)|*||~ < 00. 

The first inequality in the theorem is clear. For the second, suppose there is no 
such M < oo. Then there are countable GB-sets C; C H with || |L(C;)|*llp = 
PE\|L(C;)/ for j = 1, 2,.... By homogeneity, we can assume E|L(C;)|* = 1 
foreach j. Let Hı, Ho, ... be infinite-dimensional Hilbert spaces and form the 
direct sum H := @®;H;, so that H; are taken as orthogonal subspaces of H. 
We can take C; C Hj for each j. Let Dj := Cyr for each j. Let D := 
Uj Dj C H. Then 


* ok -—2 
E\L(D)|* = Emax|L(Dj)" < J EILD)" = YI? < oœ, 
j j 
so D is a GB-set. Thus by Theorem 2.14 again, || |L(D)|*||p < oo. But for 


each j, || |L(D)|*Ile = ILD Ilg = Jj, a contradiction, so Theorem 2.20 is 
proved. 


Next, bounds for expected suprema will be developed, closely related to 
the Sudakov—Chevet theorem 2.14. Let M, := max(Z,..., Zn) be the max- 
imum of n i.i.d. standard normal variables Z;. Then E M, is bounded below as 
follows. 


Lemma 2.21 For all n > 1, EM, > (logn)!/?/12. 


Remark. The constant 1/12 can be improved to (x log 2)~!/”, by a less elemen- 


tary proof (Fernique 1997, (1.7.1) and references given there). 


Proof. For n = 1 we have 0 > 0. For n = 2, EM) > (log 2)!/2/12 can be 
found by a direct calculation (Problem 1). The following proof is for n > 3. 
Leta := (87)~'/?. By Proposition 2.5, we have 


1 a 
P(Z>(L 1/2 > —logn/2 = l —1/2 > L 
(Z > (logn)''*) > Om logn) Bene a(nlogn) -= 


where logn < n for all n > 1 since x < e* for all x. Now, by its Taylor series, 
log(1 — x) < —x for 0 < x < 1, so 


(1-87 = ex (0e(1-8)) = f (8) = 
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Then P(M, > (log n)!/?) > 1—e™ > 0.18. Recall that for any real-valued 
function f, welet f+ := max(f,0)and fT := — min( f, 0). Thus E(M}) > 
0.18(log n)!/?. 

Next, —M, is nondecreasing as n increases, so for n > 3, E(—M,)~ 
> E(—M, ). Also, M} = 0 unless Z|, Z2 and Z3 are all negative, which has 
probability 1/8. We have E(—M, |M3 <0) > E(Z,|Z; <0)= —(2/n)?, 
Thus E(—M;) > (2/x)"?/8 > —0.1. It follows that EM, > (logn)!/?/12 
for all n > 3 and so for alln > 1. 


Here is a bound in terms of expectation and packing numbers, extending the 
more special preceding lemma. Part (b) is the form most often applied. 


Theorem 2.22 (Sudakov minoration) (a) For any countable subset S of a 
Hilbert space H with its usual metric, and any € > 0, 


1 
E sup L(x) > 77 ellos Dee, S, ay. 


xeS 


(b) Let X,, t € T, be a Gaussian process with mean 0 defined on a countable 
parameter space T. On T take the pseudometric dx defined by the process, 
dx(s, t) := [E((X, — X,)*)]!”. Then for any € > 0, 


1 
E sup X, > —e(log Dee, T, ay’: 
teT 17 


Remark. The constant 1/17 can be improved to (27 log 2)~!/* (Fernique 1997, 
Theorem 4.1.4). 


Proof. (a) Fix any £ > Oandletm := D(e, S, d). Take m points x1, ..., Xm € 
S with ||x; — xj lI? >? for i Æ j and iid. N(O, 1) variables Z;. Let V; := 
€Z;/2"?, Then E ((V; — V’) =e* < E((LQ;) — LG@;))’) for all i, j = 
1,...,m. It follows then from Lemma 2.21 and the comparison theorem 2.16 
that 


E sup L(x) > E max L(x;) > E max V; > (e/2! EMm 
l<i<m l<i<m 


xeS 


> (e/2'/?)\(log m)'/?/12 > e(log Dle, S, d))'/?/17, 


which finishes the proof of part (a). 
(b): The map tt X, takes T into a Hilbert space H = L?(Q, P), with 
{L(X;(-))}rer equal in distribution to {X;}rer. So the result follows from part 


(a). 
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2.6 Gaussian Measures and Convexity 


There are several useful inequalities about normal measures and convex sets. 
Convex sets were treated in RAP, Sections 6.2 and 6.6. 

A set C in a vector space is called symmetric if —C := {—x: x € C} = C. 
A function f is called even if f(—x) = f(x) for all x. Thus, the indicator 
function of a set is even if and only if the set is symmetric. 

For sets A, B in a vector space and a constant c let cA := {ca: a € A} 
and A+B := {a+b: acA,beB}. 


Theorem 2.23 Let C be a convex, symmetric set in R*. Let f be anonnegative, 
even function in L'(R*, B, V) where V is Lebesgue measure A‘ and B is the 
Borel o-algebra. Suppose that for everyt > 0, K, := {x: f(x) > t}is convex. 
Then forO < æ < 1, any y € IR‘, and dx := d V (x), 


J f(x + ay)dx > J fœ + y)dx. 
C C 


Proof Since f is integrable, both sides of the stated inequality are finite. First 
suppose that f is the indicator function of a set K. Then K is convex and 
symmetric. We need to prove 


V(CN(K —ay)) > VICN(K — y)). (2.22) 


For these measures even to make sense, we need to take care of a measurability 
problem. Not all convex sets are Borel sets: for example, in R2, the unit disk 
together with an arbitrary subset of the boundary unit circle is always convex. 
But we do have: 


Lemma 2.24 Any convex set D in R* is Lebesgue measurable, and its boundary 
dD has measure 0, V(0D) = 0. 


Proof. Either D is included in some (k — 1)-dimensional hyperplane, in which 
case V(D) = 0 and we are done, or D has a nonempty interior U (RAP, Theo- 
rem 6.2.6). Then D and U have the same closure and boundary (RAP, Proposi- 
tion 6.2.10; also, any point of D is a vertex of a k-dimensional simplex included 
in D). Let p € U. By translation we can assume p = 0. A straight line L through 
0 can intersect dU in at most two points: suppose 0, q, r are on L in that order, 
q Ær, q,r € OU. Now D includes a neighborhood W of 0, W = {x: |x| < ô} 
for some 6 > 0. Let A := |q|/|r|. Take points r, € D with r, — r. Then the 
convex combinations Ar, + (1 — A)w, w € W, yield all points in a neighbor- 
hood of q, so q € U, a contradiction. Then by spherical coordinates (RAP, 
Section 4.4, problems 8,9), V(dD) = 0 so D is Lebesgue measurable. 


Now continuing the proof of Theorem 2.23, let à := (1+ œ)/2, so that 
M—y) +d — à)y = —&y, and 1/2 < à < 1. Then K — &y = A(K —y)+ 
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(1 — A)(K + y) since K is convex, so 
CN(K -—ay) D MCN(K — y)) + -Ayen(K +y) 
because C is also convex. Thus 
V(CN(K — ay)) > VICA (K — y) + (1 -AXC N(K + y))}. 


Since both C and K are symmetric, C A (K — y) = -{CN(K+y)} 
and V(CN(K — y) = V(C N(K + y)). If C and K are compact, then the 
Brunn—Minkowski inequality (RAP, 6.6.1(b)), with A= CAN (K — y), B= 
CN(K + y) and still A = (1 + @)/2 gives (2.22) as desired. 

Recall that if U is any open set in a metric space andr > 0, with B(x, r) := 
{y: d(x, y) < r}, then U, := {x € U : B(x,r) C U} is always a closed subset 
of U. It is easily seen that if U is convex and/or symmetric, so is U, for any 
r > 0. Thus any open (symmetric) convex set is the union of an increasing 
sequence of compact (symmetric) convex sets K, := {x € Ujjn: |x| < n}. 
Thus for any convex set L in R*, by Lemma 2.24 there exist compact convex sets 
L(n) such that 1z) ¢ 1z almost everywhere for Lebesgue measure. Applying 
this to the different convex sets L = A, B shows that (2.22) holds for any convex 
symmetric sets C and K with V(K) < œ, so that V(K) = Oor K is bounded. 

Next suppose f is bounded. Then we can assume 0 < f < 1. Take simple 
functions approaching f, specifically 


n—-1 


fa = Ye dejnersueyn = = teeys 
k=0 j=l 

since k/n < f <(k+1)/n if and only if f > j/n for exactly k values of 

j = 1. So by linearity and (2.22), the conclusion holds for f, for each n. 

By dominated convergence, it then holds for any bounded f. Then if f is 

unbounded, let g, := min( f, n). Then the result holds for g, for each n and 

gn t+ f, so by monotone convergence it holds for f. 


Theorem 2.25 Let X be a random variable with values in R* whose law has 
a density f satisfying the hypotheses of Theorem 2.23. Let Y be any other 
R*-valued random variable independent of X. Then for any convex, symmetric 
set Cand0<a <1, 


Pr{X +aY €C}> Pr{X+ Y € C}. 
Proof. For P := L(Y), by Theorem 2.23, with — y in place of y, 


Pr{X +aY €C}= Tf f(x)dxdP(y) = Tf f(z — ay)dzdP(y) 
C-ay C 


= Tf f(z — y)dzdP(y) = Pr(X +Y EC). 
Cc 
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Corollary 2.26 Let Z and X be random variables with values in R‘, L(Z) = 
N(0, A) and L(X) = N(0, D), where D and A — D are nonnegative definite 
and symmetric. Let C be a convex symmetric set in R*. Then 


Pr(X € C) > Pr(Z e ©). 


Proof. Let Y be independent of X with L(Y) = N(0, A — D). Then L(X + 
Y) = L£(Z), and Theorem 2.25 for a = 0 gives the result. 


Corollary 2.27 Let X,,..., Xn and Y,..., Y, both be jointly Gaussian with 
mean 0 and such that {EY;Y; — EX; X j}1<i,j<n is nonnegative definite. Then 
for any M, 


Pr{ max |X;| > M} < Pr{ max |Y;| > M}. 
l<j<n l<j<n 


Proof. Corollary 2.26 applies, taking C := {t € R”: max; |t;| < M}, aconvex, 
symmetric set, then taking complements. 


For any set C in a vector space V, and a vector space W of real linear 
forms on V, the polar of C is defined by C*! := {w € W: w(x) < 1 for all 
x € C}. If C is symmetric in V then C*! is also symmetric in W and equals 
{w e W: |w(x)| < 1 forall x € C}. 

For V = R*, W will be understood to be R* also, defining linear forms via 
the usual inner product. Recall the standard normal law N (0, 7) on R£. Fora 
set C in a vector space V and a linear transformation A defined on V, AC will 
mean {Ax : x E€ C}. 


Corollary 2.28 Let A be a linear transformation from R* into itself with norm 
|| Al] := supf{|] Ax]: Ixl] < 1} < 1, for the usual Euclidean norm || - || on R. 
Then for any symmetric subset C of R, N(0, I)\(AC)*!) > N(O, D(C*}). 


Proof. If L(G) = N(O,1), then G € (AC)*! means (G, Ax) < 1 for all 
x € C, for the usual inner product, or equivalently (A/G, x) < 1 for all 
x € C. Now L(A'G) = N(0, A'A), where 7 — A'A is nonnegative definite, 
and N(O, I)((AC)*!) = N(O, A'A)(C*!). Since C*! is a convex, symmetric 
set, Corollary 2.26 applies and gives the result as stated. 


The next fact is closely related: 


Lemma 2.29 Let B be a linear operator from H into itself with ||B\| < 1 and 
C a (nonempty) subset of H. Then for any t > 0, 
Pr{|L(BC)|* < t} > Pr{|L(C)|* < t}. 


Proof. If t = 0, then the right side is 0 unless C = {0}, in which case both sides 
of the inequality equal 1 and it holds. So assume t > 0. First suppose C is finite. 
We have |L(C)|* < t if and only if |L(C/t)|* < 1. Let V be the linear span of 
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C, a finite-dimensional Hilbert space. A version of L restricted to V is given 
by L(v) = (v, w) for the given inner product where w has distribution N (0, I) 
on V. Since the joint distribution of L(x) for x € C is uniquely determined, 
Corollary 2.28 for the set {+x/t : x € C} gives the result. In general, let finite 
sets F, increase up to a countable dense set in C. Then |L(C)|* < t if and only 
if |L(F,)|* < t for all n (except possibly on a set of 0 probability) so 


Pr{|L(C)|* < t} = lim Pr{|L(F,)I* < t} 


< lim Pr{|L(BF,)|* < t} = Pr{|L(BC)|* < t}. 


2.7 Sample Continuity 


A function f from a set C in a vector space V into R will be called prelinear 
iff for any c1, ..., Cn € C andai, ..., an € R such that ajc) +- -- + anCn = 0, 
we have a; f (c1) +--+ an f (cy) = 0. 


A GB-set must be totally bounded (by the Sudakov—Chevet theorem 2.14). 
Since a uniformly continuous function on a totally bounded set must be 
bounded, every GC-set is a GB-set. 


Lemma 2.30 For any prelinear function f ona set C in a real vector space V 
into R, let 


g(xicy paas + XnCn) = xı f(c1) poss + Xn f (Cn) 


for any x1,...,X%n E€ R and c,...,Cn E€ C. Then g is a well-defined linear 
function from the linear span of C into R which extends f. Such an extension 
exists if and only if f is prelinear. 


Proof. To show that g is well-defined, suppose x1c; +--+ + XnCn = yidi + 


+++ + yndm for some c1, ..., Cn, di, ..., dm € C and x1,...,Xn, Vi, <- -3 Ym E 
R. Then xicı +-+- +XnCn — yıdı — +++ — Ymdm = 0, so since f is prelin- 
ear, x1 f (x1) + +++ + Xn f (Cn) — yi f (di) — +++ — Ym f (dm) = 0, so the appar- 


ent candidates to be defined as g(x1c1 +--+ + XnCn) are equal, and g is well- 
defined. Then g is clearly linear and extends f. On the other hand if f is not pre- 
linear, suppose x1c1 + -+ -© + Xncn = 0 but xı f (c1) +--+ + Xn f (Cn) # 0. Then 
X1C1 = —X2C2 — +++ — XnCn but xı f(c1) Æ —x2 f (c2) — +--+ — Xn f (Xn), so f 
has no linear extension to the linear span of C. 


Now, a finite-dimensional projection (fdp) will be an orthogonal projection 
(RAP, end of Section 5.3) of H onto a finite-dimensional subspace of H. For 
a sequence {7} of such projections, z, + J will mean that the range of 7t, is 
included in that of z,,; for all n and that the union of all the ranges is dense 
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in H. Since m,x is the nearest point to x in the range of z, (RAP, Theorems 
5.3.6 and 5.3.8), it follows that ||,x — x|| ~ 0 as n — œ for all x € H. For 
any orthogonal projection x, let 7+ := J — x, the orthogonal projection onto 
the orthogonal complement of the range of x (RAP, 5.3.8). 


Lemma 2.31 Whenever fdp’s m, + I, there is an orthonormal basis of H which 
includes an orthonormal basis of the range of T, for each n. 


Proof. Begin with an orthonormal basis of the range of xı, then recur- 
sively given an orthonormal basis of the range of z,, adjoin an orthonormal 
basis of the range of 7,4; — a. For each n, p41 O n = Nn = Nn O Mn+, 
so Ta Tt) = LAEE = MNn41 — My. AlSO, M4, = Wnt, O (L — Tn + Tn) = 
(n41 — An) + Tn, so the range of z,,4; is the direct sum of two orthogo- 
nal subspaces, ran(z,,; — 7,) L ran(z,,). Taking the union of the bases over n 
gives an orthonormal set whose linear span is dense in H, and thus is a basis 
(RAP, Theorem 5.4.9). 


If C is a totally bounded set in a metric space, then the set V; of all uniformly 
continuous real-valued functions on C is a vector space. For any real function f 
on C let || fllc := sup{| f(x)|: x € C}. Then Vı with norm || - ||c is naturally 
isometric to the space C(K) of all continuous functions on the completion K 
of C, where K is compact. Since C(K) is separable for the supremum norm 
(RAP, Corollary 11.2.5), V; is separable. 

Let C C H and let V, be the set of prelinear elements of V;. Each element h 
of H defines a function on C by x +> (x, h), x € C. Let Hc be the completion 
of H for ||- ||c. Note that each element of Hc naturally defines a uniformly 
continuous, prelinear function on C as a uniform limit of uniformly continuous, 
prelinear functions. Let V3 be the set of functions on C so defined. Then 
V3 C V2. (Often, V3 = V2, but whether V3 = V in all cases will not be settled 
here.) 

Let V be a set of functions on C. Say that L on C can be realized on 
V if there is a probability measure u on V such that the process (v, x) => 
v(x), v € V, x € C, has the joint distributions of L restricted to C : for any 
Xi, -.-, Xn in C, vb v(x;) are jointly Gaussian with mean 0 and covariances 
(xj, Xj), i, J = 1, RRT i 

From the definition, u would be defined on the smallest o-algebra Bc 
making all evaluations v + v(x) measurable for x € C. If D is a countable 
dense set in C, V is a set of continuous functions on C and v, w € V, then 


llv — wile := sup{|(v — w)(y)|: y € C} = sup{|(v — w)(y)|: y € D}, 


so v > ||v — w|lc is Bc measurable for w fixed. If also V is a set of bounded 
functions on C, separable for ||: ||c, as Vi, V2 and V3 are, then all open sets 
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for the ||: ||c topology are in Bc (RAP, Proposition 2.1.4 and its proof), so Bc 
equals the Borel o-algebra. 

Given a set A in a vector space, the symmetric convex hull of A is the smallest 
convex set including A and —A = {—x: x € A}, and is the set of all finite 
convex combinations `;_; À;a;, a; € A U —A, with à; > Oand $~ à; = 1, 
for all positive integers n. The closed symmetric convex hull of A for some 
topology (in this case the Hilbert norm) is the closure of the symmetric convex 
hull. Here is a set of characterizations of GC-sets: 


Theorem 2.32 The following are equivalent for a totally bounded set C in H: 


(a) C is a GC-set; 

(a') The closed, symmetric convex hull of C is a GC-set; 

(b) For any € > 0, Pr(|L(C)|* < £) > 0; 

(c) There exist fdp’s m, + I such that liminfn—> oo [Ee C)|* = 0 a.s.; 

(d) For some sequence 7T, t I of fdp’s, |L(a+C)|* — 0 in probability; 
(d’) For some sequence T, + I of fdp’s, \L(t+C)|* > 0 almost surely; 

(e) For every sequence Tn * I of fdp’s, |L(;+C)|* —> 0 in probability; 

(e’) For every sequence T, + I of fdp’s, |L(a+C)|* > 0 almost surely; 

(£) L can be realized on the completion V3 of H for || - \\c; 

(g) LonC can be realized on the space V, of uniformly continuous, prelinear 

functions; 
(h) LonC can be realized on the space V, of uniformly continuous functions. 


Proof. For any f, g € H and constant c, L(cf + g)=cL(f)+ L(g) a.s., 
since L(cf + g) — cL(f) — L(g) has mean and variance both 0. 

For any fdp x, the processes L o 2 and L o x+, when restricted to a count- 
able set, are independent and satisfy L = Lom + L o 7+ a.s. 

Let us first consider the properties (d), (d’), (e), (e^). Clearly (e’) implies (e) 
which implies (d). 

To show (d) implies (d’), let Y be a countable dense subset of C. For 
k=1,2,...,letY* = {(7; —m)y: y € Y, j = k}. Then Y* is countable and 
dense in x} C. By Lemma 2.4, |L(r}C)|* = sup, eye |L(x)| for all k a.s. For 
€ > 0, choose k = k(e) sufficiently large so that 


Pr [supsup L(t; — mx)y)| > | 


jzk yeY 


= Pr {sup |L(x)| > e) = Pr [|L Ol > e} < e. 


xeYk 


Then 


Pr | sup sup |L((m; — x;)y)| > z| < 2e. 


i,j>k yeY 
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So Pr{sup; j> SUPycy |L(Ct: — 7;)y)| > 2e} | 0 as k —> oo. Thus, 


sup sup |L((z; — 2;)y)| > 0 
i,j>k yeY 


a.s. as k — oo. Since 


ILE C)|* < sup sup |L(; — 7;)y)| 
i, jzk yeY 
a.s., it follows that [Lart O)|* — 0 a.s., i.e. (d’) holds. 

To show (d’) implies (e’), let Qm be another sequence of fdp’s with Qn t I. 
Then for k fixed as above, let e;,..., e, be a basis for the range of xg. Then 
Ove; — (in H foreach j. Since r is fixed and C is bounded, |L(Q47,.C)|* > 
0 in probability. We also have Pr{|L(Q Fr C)|* >£} < Pr{|Lort C) > 
€} by Lemma 2.29. The latter is < € by choice of k. Since |L(Q=C)|* < 
|L(O~7,.C)|* + |L(Otmp-C)|* a.s., we have |L(QC)|* — 0 in probability 
as m — oo. By the last paragraph, the convergence is almost sure. So the 
properties (d), (d’), (e), and (e’) are equivalent. 

These properties clearly imply (c). To see that they imply (f), note that in 
the above proof, each M,,(-)(@) is the inner product with some element of H, 
and M,, almost surely converges uniformly on C to M. Each M,, defines a 
measurable function from Q into H, thus into V3. Hence M defines a random 
variable with values in V3 (RAP, Theorem 4.2.2). So (f) follows. 

Clearly (f) implies (g) implies (h) implies (a). 

On the other hand, (d) implies that the M, converge uniformly also 
on the closed, symmetric convex hull sco(C) of C. In fact, for any fdp 
x, |L(x+tseo(C))|* = |L(2+C)|* a.s., since |L(—x)| = |L(x)| as. for any 
x and finite convex combinations with rational coefficients of elements of 
Y U —Y give a countable dense set in sco(C). The limit of the M, is again M, 
a version of L, now on Sco(C). So (d) implies (a’), which implies (a) clearly. 

Next, to see that (d) implies (b), given € > 0, take a fdp m such that 
Pr{|L(rtC)|* > 8/2} < 1/2. Also Pr{|L(rC)|* < ¢/2} > 0, since mC is 
a bounded set in a finite-dimensional space, and since L om and Lozt can 
be taken to be independent (on Y), we get Pr{|L(C)|* < €} > 0, proving (b). 

Next it will be shown that (b) implies (c). For € > 0, and fdp’s m, + J, 
Lemma 2.29 implies Pr{|L(r}C)|* <e} > Pr{|L(C)|* < £} > 6 for some 
ô > 0 for all n. The event D that |L(2+C)|* < e for infinitely many n, that is, 
Nm>1 Un>m {IL} C)|* < £}, thus has probability at least ô. But D is a “tail 
event,” since it depends on the sequence of independent random variables L(e ;) 
only for j > k for k arbitrarily large. It follows that D has probability O or 1 
(Kolmogorov’s zero-one law, RAP, 8.4.4), thus probability 1. This yields (c). 

Next (c) implies (a): for any € > 0, suppose that |L(tC)|* < £€/2 for 
some w and n. Then M,, being linear on the finite-dimensional bounded set 
Tn C, is uniformly continuous there, so for some y > 0, ||x — y|| < y implies 


14:22 


P1: KNP Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


CUUS2019-02 


CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


92 2 Gaussian Processes; Sample Continuity 


|M, (x) — M, (y)| < £€/2, for x, y € m,(Y) and thus for x and y in Y, and then 
since M, + Lo mt = L almost surely on Y, |L(x) — L(y)| < £. Thus L(-)(@) 
is almost surely uniformly continuous on Y, hence again extends to a uniformly 
continuous function on C which is a version of L, giving (a). 

It will now be enough to prove that (a) implies (d). Given ¢ > 0, take a 
version of L and ô > 0 such that 


Pr{sup{|L(x) — LO): x,y € Y, |x — yll < db} > £} < e. 


Take a finite-dimensional linear subspace F of H such that F N C is within 
ô/2 of every point of C. We can assume that Y N F is dense in F N C. Let 
x be the orthogonal projection onto F. Since Y is countable, we have L(x — 
y) = L(x) — L(y) and L(t+(x — y)) = L(+ x) — L(t y) almost surely for 
all x, y € Y. Then by Lemma 2.29, 


e > Pr{sup{|L(a*x) — L(w*y)|: x, y €Y, lx — yll < 6} > e} 
> Pr{sup{|L(a+x)|: x € Y} > e} 


since for any x € Y there is a y in F AY with ||x — y|| < ô and x+y = 0. 
Letting e = 1/n | 0, n > œo, (d) holds. 


Recall that a Borel probability measure on a separable Banach space B is 
called Gaussian if every continuous linear form in B’ has a Gaussian distri- 
bution. It follows that the norm || - || on B satisfies some inequalities on the 
upper tail of its distribution for u (Landau-Shepp—Marcus—Fernique bounds, 
Theorem 2.6). In particular, f lx dux) < 00. 


Theorem 2.33 Let(B, || - ||) be a separable Banach space. Let u be a Gaussian 
probability measure with mean 0 on the Borel sets of B. Then the unit ball 
Bi :={f: |f| < 1} in the dual Banach space B' is a compact GC-set in 
L?(B, p). 


Proof. A Cauchy sequence {y,} in Bj for the L?(u) norm converges in 
L?(u). Consider the weak-star topology on B’, in other words the topology of 
pointwise convergence on B. The functions in B{ are uniformly equicontinu- 
ous, indeed Lipschitz with the uniform bound | f(x) — f(y)| < lx — yl, fe 
By, x, y € B. Thus, in B|, pointwise convergence on B is equivalent to conver- 
gence on a countable dense set, so the weak-star topology on B4 is metrizable 
(cf. RAP, Theorem 2.4.4). 

Any linear function on B is given by its values on B; := {x € B: ||x|| < 1}, 
and pointwise convergence on B is equivalent to pointwise convergence on B4. 
The set of all functions from B, into [—1, 1], with the topology of pointwise 
convergence, is compact by Tychonoff’s theorem (RAP, 2.11). Clearly B; is 
a closed subset of this compact space, so it is also compact. (Compactness of 
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B; in the weak* topology for a general Banach space B is known as Alaoglu’s 
theorem, e.g. Dunford and Schwartz 1958, pp. 424—426.) 

So {yn} has a subsequence converging pointwise on B to some element y 
of B. For jointly Gaussian variables, pointwise convergence (convergence in 
probability) implies L? convergence, so y is the L? limit of {y,} and B; is 
compact in L?(). 

The natural mapping T of B’ into L?(u) has an adjoint T* taking L? (u) into 
B”, the dual space of (B’, || - ||). There is a natural map of B into B” given 
by x > (At h(x)) for x € B and h € B’. The map is an isometry (RAP, 
Corollary 6.1.5, of the Hahn—Banach theorem 6.1.4). So B can be viewed as a 
linear subspace of B”. If it is all of B”, then B is called reflexive. In the present 
case, whether or not B is reflexive, T* actually has values in B : 


Lemma 2.34 Let B be a separable Banach space and u a measure on B 
such that f I|x|[?d w(x) < œ. Then for the natural mapping T of B' into H := 
L?(), the adjoint T* on H' has values in B. 


Proof. For any h € H and y € B’, 
(T*h\(y) = (h, Ty) = f A(x) y(x)du(x) = yu) 
B 


where u € B is defined by the Bochner integral u = f g Ux)x du(x) (Appendix 
E, Theorem E.7). The linear form y can be taken under the integral sign since the 


Bochner integral, when it exists, equals the Pettis integral (Appendix E). 


So let J be the range of T*, a linear subspace of B, and S its closure, a 
Banach subspace of B. Note that each element of S is uniformly continuous on 
B; for the H = L? (u) norm topology, since it is a limit in the norm || - || on B, 
and thus uniformly on B‘, of such functions. It will be shown that w(S) = 1. 
If S Æ B, take a countable dense subset {xm} of B\S. By the Hahn—Banach 
theorem (RAP, 6.1.4) for each m = 1, 2,..., there is a Um € By such that 
Um = 0 on S and Um (Xm) = d(Xm, S) := inf{||xXn — yll: y € S}. 

For any x € B\S, let e := d(x, S) > 0 and take m with ||x — xml| < £€/2. 
Then |um(x)| > d(xm, S) — €/2 > € — €/2 — £€/2 = 0, so um(x) Æ 0. So S = 
ae i; {0}. For each m, to show that u(um = 0) = 1 is equivalent to showing 
that T(u,,) = 0. If not, let T(u,,) = v Æ 0. Then 0 < (v, v) = (T (um), v) = 
Um(T*v) = 0 since T*v € S, a contradiction. So (um, = 0) = 1 for each of 
the countably many values of m. It follows that u(S) = 1. 

Let K be the closure of the range of T in H. Then K is a Hilbert space. A 
limit of Gaussian random variables with mean 0 in L?(j2) is also such a random 
variable, so K consists of such random variables, and any finite set of them has 
a joint normal distribution. Thus the identity from K to itself is an isonormal 
process L. For this L, we can apply Theorem 2.32, where S is the space V3 of 
Theorem 2.32(f). It follows that Bi is a GC-set. 
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The next fact is a direct consequence of Theorem 2.32: 
Corollary 2.35 For any two GC-sets C, D, their union C U D is also a GC-set. 


Proof. Condition (e) or (e’) in Theorem 2.32 holds on C and on D and so, 
clearly, on C U D. 


2.8 A Metric Entropy Condition Implying Sample Continuity 


Recall that a stochastic process X;(@), t € T, is said to be sample-bounded 
on T if sup,er X; is finite for almost all w. If T is a topological space, then 
the process is said to be sample-continuous if for almost all w, t > X,(@) 
is continuous. The isonormal process is not sample-continuous on the Hilbert 
space H: let {e,} be an orthonormal sequence. Then L(e,) are i.i.d. N(O, 1) 
variables. Thus if a, —> 0 slowly enough, specifically if a,(log n)! — oo as 
n —> œ, L(anen) are almost surely unbounded (by Theorem 2.14). So not all 
bounded sets or even compact sets are GB-sets or GC-sets. Such sets must be 
small enough in a metric entropy sense. This section will prove a sufficient 
condition based on metric entropy (defined in Section 1.2), while Section 2.10 
will give a characterization based on what is called generic chaining. 

A metric entropy sufficient condition for sample continuity of L will actually 
give a quantitative bound for the continuity. Let (T, d) be a metric space, or, 
more generally, let d be a pseudometric on T. A function J will be called a 
sample modulus for a real stochastic process {X,, t € T} iff there is a process 
Y, with the same laws as X; and such that for almost all œw there is an M(w) < 00 
such that for all s, t € T, |Y; — Y;|(w) < M(@)J(d(s, t)). 

Whenever J is a sample modulus for L on C C H, and {X;, t € T} is 
a Gaussian process with mean 0 and {X,(): t € T} = C, then J is also a 
sample modulus for the process X,, with the intrinsic pseudo-metric dy(s, t) := 
(E(X, — X,)*)'” on T. 

Recall from Section 1.2 the definition of D(e, C) := D(e, C, d), the maxi- 
mum number of points of C more than € apart for d. Similarly let N(e, C) := 
N(e, C, d). In Hilbert space d will be the usual metric. In the following, when 
it is said that a stochastic process X, “can be chosen” with some properties, 
it means there is a process V, on the same parameter space T and probability 
space Q such that for each t € T, Pr(V, = X+) = 1 (V is a modification of X.) 
such that V. has the given properties. Now the main theorem of this section can 
be stated. Forms of it with expectations are given in Theorem 2.37(a) and 2.38. 


Theorem 2.36 For any C C H, if fy Qog N(t, C))'"dt < œo, then C is a 
GC-set, and if 


(log N(t, C))'/7dt, x>0, (2.23) 
0 


fœ) := fe@) = f 


then f is a sample modulus for L on C. 


14:22 


Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


CUUS2019-02 CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


2.8 A Metric Entropy Condition for Continuity 95 


Note. If C is bounded, then N (t, C) = 1 and log M(t, C) = O for t large enough, 
and N(-, C) is a nonincreasing function, so integrability of (log N (t, C))!/? is 
only an issue near t = 0. If f(x) = +00 for some x > 0, then f(x) = +00 
for all x > 0, so it still provides a sample modulus but only a trivial one. By 
Theorem 1.9, N(t, C) could be replaced equivalently by D(t, C). 


Proof. We have f(1) < oo, and we can assume C is infinite. Then H(e) := 
H(e, C) > +ooase | 0. Sequences 6, | Oand e(n) := £, | O will be defined 
recursively as follows. Let €, := 1. Given €1,..., En, let 
ôn := 2inf{e: H(e) < 2H(e,)}, 
Ent = min(é, /3, ôn). 
Then £n < 3(€, — €n41)/2. Also, if €n+1 = ôn, then 
e(n) 
f HOP < 2ean 
e(n+1) 
while otherwise €,1; = &,/3 and 
e(n) 
f H(x) dx < 2H(en41) ea. 
e(n+1) 


It follows that for any n = 1,2,..., 


2 œœ oo 
3 2 H(Em) Em < Xo (Em A Em41)H (Em)? 
oo 
< flen) < 40 &mH (Em). 


So the convergence of the above sums is equivalent to that of the integral 
defining f(1), and they all converge. 

For each n, there is a set A, C C such that for any x € C we have ||x — 
y|| < 26, for some y € A,, and the number of elements of A, is bounded 
by card(A,) < exp(22H(e,)). Let G, := {x — y: x, y € Ay_; U An}. Then 
card(G,,) < 4exp(4H(e,)). Let 


P, := Pr{max{|L(z)|/lIzll: 04 z € Gn} = 3H (en)'/7}. 


If ® is the standard normal distribution function, then for T > 0, 1 — ®(T) < 
5 exp(—T*/2) by Proposition 2.5(a). Then 


P, < 4exp{4A(e,) — 9A(e,)/2} = 4exp(—A(e,)/2). 
Since H(€n42) > H(6,/3) > 2H (en), X, P, is dominated for n > 2 by a sum 
of two geometric series, one for n even and one for n odd, and so converges. 


Then for almost all œw there is an no(œ) such that for all n > no(@), |L(z)| < 
3\|z|| H (e)! for all z € G,. 
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Either ôm = €m+1 OF Ôm < 2Em = 6€m41, SO Ôm < 6&m41 for all m > 1, 
and ys $n-1H(€,)'/2 < 00. For each x € C choose A,(x) € A, with 
|x — An(x)|| < 26,. Then ||A,—1(x) — An(x)|| < 26,-1 +26, and for almost 
allw, L(A;,(x))(@) is a Cauchy sequence for all x € C. Define a process M by 
M(x)(@) := iMm, L(An(x))(@), x € C, or M(x)(@) := 0 if the sequence 
does not converge. Then since L(-) is continuous in probability, for each 
x € C, M(x) = L(x) a.s., and M and L have the same laws (on C). From the 
definition of sample modulus we can then assume that L = M. So except for w 
in the set of 0 probability where no(@) is undefined, L(A,(x))(@) > L(x)(@) 
asn — œ forall x € C. 

If s, t€ C and £41 < lis — t| < €n, then ||A,(s) — Arl] < Ils — tIl + 
46,, and if n > no(w), then 


|L(s) — L(t)|(@) < |L(A,(s)) — L(A, Io) 


+ J IL (An(s)) = L (Am+1(9))| (0) + |L (Am() = L (Amt O) (@) 


m=n 


[o0] 
< (lls — tl] + 48n) 3H (en)? + D> 88m 3H (Enp). 


m=n 


Thus 


|L(s) — LOKO) < 751s = tl (Ils — t +1449 em Hem)! 


m>n 


< 291 f (Ils — tI). 


This is valid for ||s — t|| < €n (a). The hypothesis implies that C is totally 
bounded, so whenever no(@) < oo, |L(-)(@)| is bounded on C, say, by 
K(q@). So the conclusion of Theorem 2.36 holds with M(qw) defined as 
max (291, 2K(w)/f (Enyw))) - 


It follows from Theorem 2.36 that C is a GC-set if ase | 0, N(e, C) = 
O(exp(e’)) for some p < 2, or if N(e, C) = O(exp(e~?| log €|~”)) for some 
r > 2. On the other hand, Theorem 2.14 implies that C is not a GB-set 
if as e | 0, eventually N(e, C) > exp(e~”) for some p > 2 or N(e,C) > 
exp(e~7| log e|°) for some s > 0. It turns out that the gap cannot be closed 
further: if N(e, C) is of the order of exp(e~>| loge|~”) for O <r < 2, there 
are examples showing that C may or may not be a GB-set (see Problems 14 


and 15). So a characterization of the GB-property cannot be given in terms of 
metric entropy, although it comes rather close. For a characterization in other 
terms, see the next section. 
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Remark. If C is a GC-set, then L(-)(-) can be chosen such that for all @, 
x +» L(x)(@) is continuous for x € C. Then for any countable dense subset A 
of C, L(C)* = suppea L(x) as. 


Next, the same integral as in Theorem 2.36 yields a bound for expectations 
of certain suprema. 


Theorem 2.37 Let C C H be nonempty and let D := diam C = 
SUP; yec lx — yll. Let B := B(C):={x-—y: x,y € C}. Then for fc as 
in Theorem 2.36, 


(a) E\L(B)|* < 81 fc(D/4) and 
(b) EL(C)* < 81 fc(D/4). 


Remarks. All three quantities in (a), (b) are invariant under translation, replac- 
ing C by {c +u: c € C} for any fixed u. But E|L(C)|* does not have such 
invariance, and becomes unbounded as ||u|| — oo, so for it we cannot have an 
upper bound K f(D), K < œ. 

If the constant 81 is replaced by a larger one, one can have, instead of the 
quantities on the left in (a) and (b), Young—Orlicz norms (Appendix H) || - Ilg 
where g(x) := exp(x7) — 1; see Theorem 2.20. 


Proof. Note that log N(t,C) =0 for t > D/2, so f(D/2) = f(+oo). If 
f(x) = +00 for some (and hence all) x > 0, then (a) and (b) hold trivially 
(under the given definitions). If f(D) < oo, then we can take L sample- 
continuous on C by Theorem 2.36. 

Before proving Theorem 2.37, here is a consequence: 


Theorem 2.38 Let C C H with, for fc as defined in Theorem 2.36, fc(x) < 
+00 for some, or equivalently all, x > 0. Then for any 6 > 0, 


E (sup{|L(@x) — LO): x,y €C, lx — yll < 8¥) < Kfc(6/4) 2.24 
for K = 1624/2. Thus if {Xi yer is a Gaussian process with mean 0, it can be 


chosen so that for any 5 > 0, 


3/4 
E(sup{|X; — X;|: s,t E T, dx(s,t) < ô} < K log N(u, T, dx)du. 
0 


(2.25) 


Proof. As in Theorem 2.37 let B := B(C) := {x — y: x,y € C}. For any 
€ > 0 we have N(e, Bs) < N(e, B) < N(e/2, C). We also have diam(Bs) < 
26. Let Ds := B(Bs):= {x — y : x, y € Bs}. Then N(e, Ds) < N(e/4, ©}, 
so by Theorem 2.36, Ds is a GC-set, and by Theorem 2.32, we can take L to 
be prelinear on it. Applying Theorem 2.37(a) we get 


E sup{|L(x) — L(y)|: x,y € C, lx — yll < 6} = E|L(Bs)|" < E|L(Ds)I" 
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(because 0 € Bs, so Bs C Ds) 


< 81 fg, (28/4) < 81 fg(8/2) 


8/2 ô/4 
<81 V2log N(e/2, C)de = 162V2 f vlog N(u, C)du, 
0 0 


proving (2.24). For (2.25), if the right side is +-oo it holds trivially. If it is finite, 
apply the equality in distribution of {X; }er and {L(X;(-)) }rer for X;(-) € H = 
L?(P), with dy(s, t) = ||Xs(-) — X)|] as usual. 


To prove Theorem 2.37 here is, first: 


Lemma 2.39 Let go(u) := 2u¢(®~'(1/(2u))) for u > 1/2, where @ and ® 
are the standard normal density and distribution function, respectively, and 
-l (y) := x such that B(x) = y, 0 < y < l. Then go is concave. For any 
random variable Z with distribution N (0, 07) and any event A with P(A) > 0, 


fizar < o P(A)go(1/P(A)). (2.26) 
A 
Proof. Let h(v) := go(v/2) = v¢(®7!(1/v)) for v > 1. Then h'(v) = 
-1 -1 -1 1 1 
p (1/v)) + vo (1/v) (P o) ED ( =) 


$(@-"(1/v)) + Toyo), 


n -1 1 1 —1 
h (v) = ®© (1/v). (=) — z? (1/v) 


1 1 1 1 
+ s 5 = < 0, 
v TA) ( J PEA) 
so h is concave for v > 1 and gg is concave for u > 1/2. 
Next, the left side of (2.26) is maximized for fixed P(A) > 0 when A is a 
set {|Z| > r} for some r > 0, by the Neyman-Pearson lemma (e.g. Lehmann 
1986, p. 74). Then P(A) = 2®(—r/o) and 


fizar = (2/m)'/o - exp(—r7/(20”)). 
A 


Letting x = r/o we need to prove, for x > 0, 


2$(x) < 2@(—x)go(1/(2H(—x))). (2.27) 
Setting u := 1/(2®(—~x)), so that x = —-!(1/(2u)), shows that (2.27) holds, 
with equality. 
Lemma 2.40 If Z|, ..., Zy are each normally distributed with mean 0 and 


variance < o°, then Emaxj<j<y |Zj| < ogo(). 
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Proof. The probability space Q is a union of disjoint events A; such that 
|Z;| = max)<j<y |Z;| on A;. Thus 


N N 
E max |Zj| = >> | IZjl\dP < X` oP(Aj)go(1/P(Aj)) < ogo(N) 
fare’ j=l 


1<j<N 


by Lemma 2.39, from which go is concave. 


Lemma 2.41 Let g\(u) := K(log(1 + u))!”? for u > 0 where 
K := (2+ [4+ log 4]/(log(3/2)))'/?. 
Then g\(u) > go(u) for allu > 1. 
Proof. “Mills” ratio” satisfies, for x > 0, Komatsu’s inequality 
M(x) = O(—x)/O(x) > 2/@ +a +4)"”) 


(RAP, Section 12.1, Problem 7). Thus M(x) > 1/(x? +4)! Let x := 
——!(1/(2u)) > 0, so u = 1/(2(—x)). Then we need to show that 


gil/(2@(—x))) = o(x)/P(—-x) = 1/M&x), 


so it will be enough to prove g)(1/(2®(—x))) > (x? + 4)!”. Since @(—x) < 
exp(—x?/2) for x > 0 (RAP, Lemma 12.1.6(b)), and gı is nondecreasing, it 
will be enough to show 


gi(exp(x?/2)/2) = 7? +4)'?, x20. 
Letting y := exp(x?/2)/2, we need to show 
gi(y) = Klog +y)? > 4+ 2logQy))'”, y> 1/2, 
or 


K? log(i+ y) > 4+2log2+2logy, y > 1/2, 


which follows from the definition of K. 


Now to prove Theorem 2.37, D = 0 if and only if C consists of a single 
point. Then both sides of (a) and (b) are O and they hold. So assume D > 0. 
Let €% := D/2* and Ny := N(ex/2,C), k=0,1,.... Then for each k = 


0,1,2,..., there is a set Cy of Ng points xzj, j = 1,..., Nx, such that for all 
x € C, ||x — xg; || < €x for some j. Then No = 1, so Co = {x01} for some x9}. 
For each k = 1,2,..., and j = 1,..., Nx, choose and fix a point ykj = Xk—1,i 


for some i such that || xk; — yxj|| < €x-1. Let Wg be the set of all variables 
L(x) — LO), j = 1, ..., Ne. Then by Lemma 2.40, 


E max{|z|: z € Wy} < &_1g0(Nx). 
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For any ug € Cx there is a sequence of points u; € Cj, j = 0, ..., k, such 


that L(u;j)— L(uj-1) € Wj, j= 1,...,k. Thus 


k k 
E sup{|L(x) — LO: x,y e| ]C} < 29 ej-180(N;). 


i=l j=l 


The union of all the C; is dense in C, so by sample continuity and monotone 
convergence, 


CO 
Ec := Esup{|L(x)— L(y): x,y €C} < 2) ej-1g(N;) 
j=l 


CO 
= 4D)° g0(Nj)/2/. 
j=l 
By Lemma 2.41, where K < 4, we get 
CO 
Ec < 16D) doga + Nj))'?/2!. 
j=l 
Forall j > 1, N; > 2,so[log(1 + N;)/log N;]'/? < dog3/log 2)!” < 1.26. 
Thus 


Ec < 202D5E% log N)/2 < 8172) Sejti dog N(t, C))dt 
81 f(D/4), proving (a). Then for any fixed y € C, 


ap L(x) < Liy)+ mE) = L(y), 


XE 


so (b) follows and Theorem 2.37 is proved. 


2.9 Gaussian Concentration Inequalities 


This section is based on excerpts from the book of Ledoux (2001). Let (S, d) 
be a metric space and u a probability measure on the Borel sets of S. For a 
set A C S and r > 0, let A” := {y € S: d(x, y) <r forsome x € S}. The 
concentration function for u (and the given metric d) is defined by 


aul) := Qu alr) = sup{l — u(A"): ACS, w(A)> 1/2}. (2.28) 


Here A ranges over Borel sets. Each set A” is open, for any set A. In a simple 
example let S = R with usual metric and u = N(O, 1). For A = (—oo, 0] we 
have 1 — w(A’) = 1 — (r) < exp(—r? /2) where ® is the standard normal 
distribution function. The choice of A turns out to be an extremal one among 
all sets with (A) > 1/2, so that in fact a,,(r) < exp(—r?/2) (Ledoux 2001, 
(1.4)) although that will not be proved here. Inequalities of the same form, 
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possibly with right-hand side of the form bexp(—cr7) for some b > 0 and 
c > 0, hold in multidimensional and even infinite-dimensional spaces. Such an 
inequality, although not with best constants, will be proved. 

There is a wide choice of measurable sets A with w(A) > 1/2. Among the 
most useful are the following. Let F be a real-valued measurable function 
on S, which thus is a random variable on the probability space (S, u). Let 
m(F) be a median of F (which is not necessarily unique). Then the sets 
A, := {x : F(x) < m(F)}and A2 := {x : F(x) > m(F)} satisfy w(A;) > 1/2, 
i = 1, 2.First, let A = Aj. In order that sets A‘ should be meaningful in relation 
to F it is useful that F be a Lipschitz function, meaning that the Lipschitz 
seminorm ||F||z := sup,z, |F(x) — F(y)|/d(x, y) is finite. Let K := || Flv. 
Then for any r > 0 we have 


Pr(F —m(F)> Kr) < a,(r) (2.29) 


since the given event is included in the complement of A}. Using also A = A3 
we would get 


Pr(|F —m(F)| > Kr) < 2a,(r). (2.30) 


Thus if we have good Gaussian-type bounds on a@,,(r), the values of Lipschitz 
functions with bounded Lipschitz seminorms will be concentrated close to their 
medians (and, it will be seen in Proposition 2.46, also close to their means). 

One class of examples of Lipschitz functions is bounded linear functionals 
on normed linear spaces, whose Lipschitz seminorms equal their dual norms. 
One can take the maximum, or supremum, of two or more Lipschitz functions 
with bounded || - ||; and get a Lipschitz function, as follows: 


Proposition 2.42 Let (S,d) be a metric space. Let F be a collection of real- 
valued Lipschitz functions on S for d such that M := sup rez | fll < +00. 
Let F(x) := SUP feF f(x) and suppose that F(x) < +œ forall x. Then F is a 
Lipschitz function with ||F || < M. 


Proof. Suppose that | F(x) — F(y)| > Md(x, y) for some x, y € S. By sym- 
metry we can assume that F(x) > F(y)+ Md(x, y). Then for some f € F, 
f(x) > F(y)+ Md(x, y) = f(y) + Md(x, y), but this contradicts || f ||; < 
M, finishing the proof. 


The hypotheses of the preceding proposition clearly hold whenever F is 
a finite set of Lipschitz functions. By taking suprema, a rather large class of 
Lipschitz functions can be generated starting with some basic ones. 

Let {X;}rer be any Gaussian process with mean 0 and G a finite subset 
of T. Let F := max;eg X;. By way of Theorem 2.2 and Proposition 2.3, we 
can view each X, as a linear function f, on a finite-dimensional Hilbert space 
having the probability measure y = N(O, I). Let o, := (E(X?))'/, Then f; 
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is Lipschitz with || f| = o; and by Proposition 2.42, F is represented as a 
Lipschitz function with 


ElI < og := max o;. (2.31) 
teG 


Here is a concentration inequality (one part of Ledoux 2001, Theorem 2.15). 
Ledoux (2001, Corollary 2.6) gives a Gaussian concentration inequality with 
exp(—r?/2) in place of 2 exp(—r7/4) and F — EF in place of F — m(F), but 
with a harder proof. 


Theorem 2.43 Let d = 1,2,... and let y be the standard normal law N(O, I) 
on R. Then for all r > 0, 


a(r) < 2exp(—r?/4). 


An inequality to be used in the proof is one that Ledoux (2001, p. 33) calls 
a functional version of a multiplicative Brunn—Minkowski inequality. 


Theorem 2.44 Let f, g, and h be nonnegative, measurable functions on R? 
with g and h integrable (for Lebesgue measure) and let O < 0 < 1. Suppose 
that for all x and y € RI, 


fOx +A- 0y) = gah. (2.32) 


6 1-6 
fto > (/ s(s)dx) (J hiy)dy) . (2.33) 


Remarks. If 6 = 0 or 1, the conclusion holds trivially. If f gdx or f hdx is 
+00 and the other is positive, the conclusion still follows via approximation by 
increasing sequences of integrable functions. If one of the two integrals is +00 
and the other is 0, the best available lower bound for f f dx is 0, as is seen by 
setting f = h = 0,g = 1. 


Then 


Proof. The proof will be by induction on d. First, let d = 1. Ledoux (2001, p. 
34) says in half a sentence “we may assume ... by approximation that” g and 
h “are continuous with strictly positive values.” In the following proof, more 
than a page is spent on proving one can take them continuous and then another 
half-page on taking them strictly positive. After that the remaining page of the 
proof proceeds almost exactly as in Ledoux’s book. 

For any M with 0 < M < +00, if we replace g by gm := min(g, M) and h 
by hy := min(h, M), the assumption (2.32) still holds, and if the conclusion 
does, we can let M f +00 and use monotone convergence to get the conclusion 
in general, so we can assume that for some M with | < M < +œ, 0 < g(x) < 
M and 0 < h(x) < M for all x. Similarly, we can assume that for some L < 
+00, g(x) = h(x) = 0 for |x| > L. 
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Let à be Lebesgue measure on R. Given any £ > 0, by dominated conver- 
gence thereisaô > O with ô < €/M suchthatifA(B) < ô, then IRG +h)dx < 
£. To approximate g and h by continuous functions, first, by Lusin’s theorem 
(RAP, Theorem 7.5.2) there is a compact set K C [—L, L] such that (g, h) from 
K to R?, thus g and h from K into R, are continuous, and A([—L, L] \ K) < ô. 
We have 0 < glx < g and0 < hlg < h. Thus (2.32) holds for glx and hlx 
in place of g and h, with f (h — hlg)dx < £ and f(g — glg)dx < e. 

There exist continuous functions g, with g, = g on K, 0< g, <M, 
8n = 0 outside [—2L,2L], and g, | glx as n > +00. Likewise there exist 
hn for h. Recall that the support of a function ¢ on a topological space is the 
closure of the set on which ¢ Æ 0. 

I claim the following: if ¢ is a continuous function from R? into R having 
compact support J, then the function ¢;(x) := sup, @(x, y) is continuous on R 
and has compact support. To see this, we have J C Ax B for some compact 
subsets A and B of R. Let x, > x € A. If dı(xn) Æ G1 (4), taking a subse- 
quence we can assume that ¢)(x,) = $(x;, Yn) where y, — y for some y € B, 
SO (Xn, Yn) > P(X, y) < bi). If OO, y) < bi), let G(x) = OC, u) for 
some u. Then for n large enough, @(x,, u) > }(Xn, Yn), contradicting the choice 
of y, and proving the claim. 

If g and A are continuous on R with compact supports A and B respectively, 
then the function F(z, y) := g(z — y)h(y) is continuous on R?, and it has 
compact support since itis 0 if y ¢ Borifz—y ¢ A,andsoifz¢gA+B:= 
{a+b: a € A,b e B}, thus F is 0 outside the compact set (A + B) x B. 

For each n, the function G,(z, y) := gn([z — (1 — 9) y]/0)°An(y)'~? is sim- 
ilarly continuous with compact support, thus by the claim, so is f,(z):= 
sup, G,(z, y). It follows that f,(0x + (1 — @)y) > gax) h, (y)? for all x 
and y. Since g, and h, are nonincreasing sequences of functions with respect 
ton, so is G, and thus f,. We have 


Girlz, yY) 4 Golz, y) = (glx (lz — (1 — 0)y1/0) A)? 


for all z and y. I claim that f(z) 4 fo(z) := sup, Go(z, y) for all z. Clearly 
Sn(Z) = fo(z) for all n and z. Suppose that for some Z, fa(z) 4 c > fo(z). Then 
for some yn, Gn(z, Yn) = c > O for all n. Taking a subsequence, we can assume 
that y, —> y for some y. Let x := [z — (1 — 0)y]/0, so that 0x + (1 — @)y = z. 
If y ¢ K, then by Dini’s theorem (RAP, Theorem 2.4.10), h, — 0 uniformly in 
some neighborhood of y, so h,(y,) —> 0 and since 0 < g, < M, G,(z, Yn) > 
0, a contradiction. So y € K. Likewise, x € K. Thus for all n =0,1,..., 
Ga (z, y) = gœ hy)? , and folz) = gP hy). 
Let x, := [z — (1 — 0)yn]/0 —> x. We have 


lim sup hn (yn) < lim sup hın) = hı) = h), 
noo noo 
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and likewise lim sup,,_,.5 8n(Xn) < g(x), so 


lim sup Gp (z, Yn) < 8P hO? < fol), 


n= 


contradicting the choice of z. Thus fa | fo as claimed. 

Since f, are uniformly bounded by M? and all have supports included in 
[—2L, 2L], by dominated convergence we have f f,(z)dz | f fo(z)dz.By the 
definitions, the hypothesis (2.32) holds for the continuous functions fn, 8n, An 
in place of f, g, h, noting also that fọ < f, and if we can prove the conclusion 
for fa, Sn, and hy, it will follow for f, glx, and hlx. Then letting ¢ | 0 it will 
follow for f, g,and h. So we can assume f, g, and h are each continuous with 
supports included in [—2L, 2L]. 

Let y be a continuous real function on R, with 0 < y(x) < 1 and y(—x) = 
w(x) for all x, w(x) = 1 for |x| < 2L, w(x) decreasing as |x| increases for 
|x| > 2L, and with y going to 0 at oo fast enough so that y? and y!~° are 
both integrable. Suppose for some constant ¢ with 0 < ¢ < 1, where eventually 
¢ — 0, we replace g by g + ¢yW and h by h + Cy. It will be shown for all y 
and z, setting x := [z — (1 — 0)y]/6 so that z = 0x + (1 — 0)y, we have 


(EED h + ow)? 
< FOM DEYE? + CU)! + ye oT 
+ ¿a —6)"'}). (2.34) 


To prove this first suppose |z| > 2L. Clearly |z| < @|x|+(U —4)|y| < 
max(|x|, |y|), so either |x| > 2L or |y| > 2L. If, for example, max(|x|, |y|) = 
ly| > |z|, then w(y) < w(z). For any u with |u| > 2L we have g(u) = h(u) = 
0. Thus the left side of (2.34) is bounded above by (M + 1[(¢w(z))? + 
(¢y(z))!~*], which implies (2.34). 

So, suppose |z| < 2L. Then w(z) = 1. The derivative with respect to ¢ of 
the left side of (2.34) is 


Og H ENOTI POAN + Cw)? + (g ENNA — OA + EW)? 
x Wy) < UET YANM + 2? + (M + 12° — NEVO)? 
x WOM + DEIT! +677]. 


The indefinite integral of this bound from 0 to ¢ gives that the left side of (2.34) 
is bounded above by f(z) +(M + DIOT! 4 ¢!-8(1 — 6)~], so (2.34) is 
proved in both cases. 

Thus if we replace g, h, and f by g+¢w,h+ Cy, and the right side of 
(2.34) respectively, hypothesis (2.32) holds, and each function is integrable, 
with integrals approaching those of g, h, and f respectively as ¢ | 0. So we 
can assume g and h are continuous and everywhere strictly positive. Clearly 
we can assume f gdx = 1 and f hdy = 1. 
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Now, the function x > ie g(u) du is C!, with a strictly positive derivative 
g, from R onto (0, 1). So this function has an inverse t + x(t) from (0, 1) onto 
R, with a continuous derivative x(t) = 1/(g(x(t)) for all t € (0, 1). Likewise, 
yR La h(v) dv is C! with a positive derivative h and has an inverse t œ> 
y(t) from (0, 1) onto R with a continuous derivative y'(t) = 1/h(y(t)). Let 
z(t) := 0x(t) + (1 — 0)y(t) for O < t < 1. For any a > 0 and b > 0 we have 
ĝa + (1 — 0)b > a?b!™®. (Setting u := a/b > 0 this is equivalent to 0u + 1 — 
6 > u?, which holds with equality for u = 1 and follows for all u by taking 
derivatives with respect to u.) It follows that for O < t < 1 


Z(t) = OHA- OVA = PO. (2.35) 


Since z’ > 0 is continuous it follows from the hypothesis (2.32) and (2.35) that 


1 
f twas = fto = Í FEO A dt 


IV 


1 
[ EAP RON T EOS O N? dt 


1 
1 LEE NPAO O dt 
0 
= 1. 
Thus the case of dimension d = 1 is proved. Now suppose d > 1 and that 
the conclusion holds in dimension d — 1. Let f, g, and h on Rf satisfy the 


hypotheses. For each x € R^! and u € R let f,(x) := f(x, u) and likewise 
define g, and h,. If u = Ouo + (1 — @)u, for some real uj, then 


fulOx +A OY) = Bup(x)? Au, O? 


for any x, y € R1. It follows by the induction hypothesis, where the integra- 
bility of fu, 8u,, and hu, holds for Lebesgue almost all u, uo, and u; (which 
will suffice for the conclusion), that 


0 1—0 
nods ( f tu (xd) ( f Ino) dy) 
Rea! R¢@-1 R¢d-1 


This gives the hypothesis (2.32) for the one-dimensional case with functions 
of u, where in the hypothesis u = ĝuo + (1 — @)uy, and of uo and u1, so by the 
conclusion in that case 


J f@dz = / ( Í 7 hdx) du > ( I sodo) ( / hudu) 


which finishes the proof of Theorem 2.44. 
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Now to continue the proof of Theorem 2.43, for a bounded measurable 
function j on R? and c > 0, define the “infimum-convolution” Qe j by 


(QDE) = inf | jv) + 51x — v | 


for all x € R°. It’s easily seen that Qj is a well-defined real-valued function. 
It will be shown that 


[er Qiicndre f exp-iondr) <1. (2.36) 


For this, one will apply Theorem 2.44 for 6 = 1/2, then square both sides 
of the conclusion. Let ġa be the density of y = N(0, I). Let f := da, 
g(x) := exp(Q1/2,(«))ba(x), and h(y) == exp(—j(y))ba(y). Hypothesis (2.32) 
then requires that f((x + y)/2) > /g(x)hO) for all x and y. To check this, 
note that the normalizing constants of the ¢@,’s cancel. Taking logarithms of 
both sides, we get an inequality that holds by definition of Q1/2 j. The integra- 
bility of f and h is immediate, and that of g is easy (let v = x). Thus Theorem 
2.44 does apply and gives the conclusion (2.36) after squaring. 

Now the bound (2.36) will be extended to any measurable function 
j such that 0 < j(x) < +00. Let ją be nonnegative bounded, measurable 
functions increasing up to j. Then (2.36) holds for each ję. By mono- 
tone convergence, f exp(—j,)dy decreases down to Io := f e~/dy > 0 and 
f exp(Q1/2jx(x))dy(x) increases up to Jı := {(exp(Qij2j(x))dy(x) < +00. 
If Jọ > 0, then J; < +00 by Fatou’s lemma and (2.36) holds for j. If Jp = 0, 
then possibly J; = +0ọ, but if we make the convention that (+00)-0 < 1 in 
this case, then (2.36) still holds. 

Next the following will be proved: 


Proposition 2.45 For any measurable set A C R? and p > 0 we have 
y({x: inf |v — x[?/4 > o} < e” /y(A). 
vE 


Proof. If y(A) = 0, in other words A(A) = 0, then one may say that the right 
side is +00 and the inequality holds trivially. So suppose y(A) > 0. 

Apply (2.36) to the function j(x)=0 on A and j = +00 outside it. 
Then fet dy = y(A) > 0 so both factors on the left in (2.36) are finite, 
and it gives f exp(Q12j(x))dy (x) < 1/y(A). It follows from the Markov— 
Chebyshev inequality that y({x : Q1/2j(x) > p} < e-?/y(A). For the given 
j we have 


Qija) = inf{j) + |x — vP/4} = inf [x — vl?/4, 


and the proposition is proved. 
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Now to finish the proof of Theorem 2.43, let p := r?/4, sor = 2,/p. Then 
for any x, x € A’ if and only if for some v € A, |x — v| < r, or |x — v|?/4 < 
r?/4 = p, so the theorem follows from Proposition 2.45 in light of the definition 
(2.28) of æn. 


The bound for a,(r) given by Theorem 2.43 can be used to bound the 
probabilities of deviations of a Lipschitz function F from its median m(F) 
in (2.29) and (2.30) for u = y. For any Lipschitz function F on R? we have 
|F(x)| < |F(O)| + ||F llc |x| for all x and consequently that FE, F = [Edy 
exists and is finite. Also note here that with respect to y, any continuous 
function F (in particular any Lipschitz F) has a unique median: if it had a 
nonunique median there would be an interval [a, b] of medians with a < b 
such that y (F7! ((a, b))) = 0. But F must take values > b and < a, thus by 
the intermediate value theorem, it takes all values in (a, b), so F~'((a, b)) is a 
nonempty open set and must have probability > 0 for y. 

To compare the mean and median we have: 


Proposition 2.46 There exists an absolute constant C such that for any d, any 
Lipschitz function F on R¢ with K = ||F ||, and its unique median m(F) for 
y, we have |EF — m(F)| < CK. Thus for anyr > C, 


Pr(F — EF > Kr) < a,(r—C) (2.37) 
and 
Pr(|F — EF| > Kr) < 2a,(r — C). (2.38) 


Proof. For any random variable U let UY := max(U, 0). Letting U := 
(F — m(F))/||F || we get EU < E(U*) bounded by (2.9) for Y = U*, then 
using (2.29) and, for example, Theorem 2.43 to get a bound by an absolute 
constant C, not depending on F or the dimension d. The other conclusions then 
follow. 


Next, the essential supremum, or essential supremum of absolute values, of 
L on a GB-set satisfies concentration inequalities. Recall L(A)* and |L(A)|* 
from Lemma 2.4. 


Theorem 2.47 Let H be a Hilbert space and A a GB-set in H. Let o := 
SUP eg IlX||. Let Y = L(A)* or |L(A)|*. Then 


(a) For anyr > 0, Pr(Y —m(Y) > or) < 2exp(—r7/4). 
(b) For any r > C from Proposition 2.46, 


Pr(Y — EY > or) < 2exp(—(r — C)*/4) 
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and 


Pr(|Y — EY| > ør) < 4exp(—(r — C}? /4). 


Proof. A GB-set, being totally bounded by Theorem 2.14, is bounded, so 
o < +00. By the proof of Lemma 2.4, we can assume that A is countable 
(replacing A by a countable dense subset B). Consider finite subsets Bg + 
A. For each k and each x € By, we can view L(x) or |L(x)| as a Lipschitz 
function on R? where d is the cardinality of B,, with Lipschitz seminorm < ø, 
by Proposition 2.3 and (2.31). Then, applying Proposition 2.42, we get that 
L(B,)* or |L(B,)|* has the same property. Thus (2.29) applies with F = Y and 
K =o. Then applying Theorem 2.43 gives (a) for A = Bx for each k. Since 
the bound does not depend on k, we can let k — oo and get the bound as stated 
for A. 

For part (b), we can then just apply Proposition 2.46. This finishes the 
proof. 


2.10 Generic Chaining 


There are some differing definitions of subgaussian random variable or process. 
Here is a condition used by Talagrand (2005, (0.4)). 


Definition. Let (S,d) be a metric space. A real-valued stochastic process 
{X;}res will be called 7-subgaussian iff it is centered, i.e. EX, = 0 for all 
t € S, and for every u > 0 and s Æ t in S we have 


u2 


Note that although (2.39) puts strong restrictions on the distributions of 
differences of values of the process, and the distributions of X, must satisfy 
E|X,| < œœ for all t since EX, = 0, nothing in the definition of T-subgaussian 
requires that E(X?) < oo or that E|X;|? < œœ for any p > 1. 

A centered Gaussian process is always T-subgaussian for the L? metric 
d(s, t) = (E[(X, — X)?! if it is a metric, by Proposition 2.5(a); otherwise, 
for any s Æ t such that X, = X; a.s., the left side of (2.39) is 0, and we may 
say the inequality holds with 0 < 2 exp(—oo) = 0. For any Hilbert space H the 
isonormal process L is T-subgaussian on H, or any subset, for the usual metric 
on H. 

A decomposition or partition of a set S will mean a collection of finitely 
many disjoint nonempty sets whose union is S. If (S, d) is a metric space and 
A C S, the diameter diam A is defined as sup, ;<4 d(s, t). A sequence {Ax }x>0 
of partitions of S will be called nested if for each k = 1,2,..., each set in 
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Ax is a subset of some set in A-1. A nested sequence of partitions A, will 
be called a Talagrand sequence if Ao consists of the one set S, and for each 
k=1,2,..., Ag contains M(k) sets where 


1 < M(k) < N; := 2”. (2.40) 


If X, and Y, are stochastic processes with the same index set S and on the 
same probability space, the two processes are said to be modifications of each 
other if for each t, X, = Y, with probability 1. Thus {X;}res will be said to 
have a sample-bounded modification if it has a modification {Y;}res which is 
sample-bounded. Talagrand’s generic chaining method gives a characterization 
of sample boundedness for centered Gaussian processes. It can be compared 
to the facts based on covering or packing numbers (metric entropy) as in 
Theorems 2.14 and 2.36, which do not give characterizations but are perhaps 
more accessible. 

Given a sequence {.A;},>0 of partitions, at € S and a k, let A;(t) be the set 
A € A, such that t € A. 

For any (totally bounded) metric space (S, d), define 72(S, d) as the infimum 
over all Talagrand sequences {Ax};>0 of supjes peg 2*7 diam A;(t). 


Theorem 2.48 If(S, d) is a metric space such that a(S, d) < coand {X;};e5 is 
a T-subgaussian stochastic process, then it has a sample-bounded modification 
{Y;}1es. Moreover, Y, can be chosen such that 


E sup Y, < 7y(S, d). (2.41) 
t 


Proof. Suppose y2(S, d) < +00. Let {Ax}x>0 be a Talagrand sequence of parti- 
tions of S such that 


CO 
G i= sup X` 2*/?diam Ax(t) < ©. 


teS k=0 
We have 
A; := max diam A > 0 (2.42) 
AGA, 


as k — ov, since if not, for some £ > 0, Ay > €, implying G > 2/28 —> +00, 
for infinitely many k, in fact for all k since the partitions are nested and so A; are 
nonincreasing in k. Then G = +00 is a contradiction. Let M(0) := 1. For each 
k = 0,1, ...,form aset 7, C Shaving M(k) points by choosing just one point 
from each of the M(k) sets in Ag. Since the partitions are nested, we can and 
do choose 7; such that 7, C T+; for all k. For any t € S, let m(t) be the one 
point in Ty N A(t). For any t € S, u > 0, and k = 1,2,..., we have by (2.39) 


Pr {| Xn — Xma) > u dolt), te-1())} < 2exp (—u72®'). 
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Let 


Qu = NN {|X -Xni < u2*? dre), m1) . 


k>1 tes 


The number of possible pairs (zr, (t), 7%-1(t)) is at most 2% by (2.40) since 
x(t) uniquely determines 7—1(t). Thus 


oe) 
Pr (2%) < gu) = 922?” exp (-w2k!). 
k=l 
Now u22k-! > su + u22k-? > sue + 2+! for u > 3, and then 


[0.6] 
ana Aa a a eur, (2.43) 
k=1 
For any t € S and k > 1, both m—ı(t) € Agz—ı(t) and m(t) € A(t) C 
Ax_,(t) since the partitions are nested, so d(m;(t), m,_1(t)) < diam Ag—1(¢). 
We have for any ż that 


CO oo 
> 2/-diam Ay_\(t) = 5 2+D/2 diam A(t) < V2G < 2G. 
k=1 k=0 


Thus for any u > 0 and w € Q,, forall t € S, 
oe) 
X [Xno — Xmv| < 2uG. (2.44) 
k=1 


As u > +00, Pr(Q*h) —> 0 by (2.43), so almost every w is in some &2,,, and the 
telescoping series 
CO 
Y, = Xo + 2 Xna — Xma) = iim Xat) 

converges almost surely. More precisely, define Y, as the limit if it exists and 0 
otherwise, so itis a measurable function of the countable set of random variables 
X, fors € T; for all k. If t € Ty for some k, then since the sets 7; are nested we 
have x;(t) = t forall j > k and simply Y, = X,. Condition (2.39) implies that 
the process t +> X;(-) is continuous in probability, and thus by (2.42) for each 
t, Y, = X, a.s., so Y, is a modification of X,;. We have sup, |Y; — Xn| < 2uG 
for all w € Qu, so sup, |Y;| < |X| + 2uG < œ a.s., proving that {Y;}es is 
sample-bounded and {X;,},<5 has a sample-bounded modification as claimed. 

Next, let n(@) := super Yr, which is measurable because the sup can be 
restricted to the countable set J, Tk. Wehave En = E(n — X,,) since EX, = 0 
(the process is centered by definition of T-subgaussian) and 


0 < n— Xn = sup(Y, — Xn) < 2uG 
tes 
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for each w E€ Q, by (2.44). Let Z := (n — Xn)/(2G). Then by (2.43), since 
Z>0, 


CO CO 
EZ = / Pr(Z > u)du < 3 +f exp(—u?/2)du < 3.1, 
0 3 


so En < 6.2G. Varying partitions and letting G | y2(S, d), (2.41) follows, 
proving Theorem 2.48. 


For the isonormal process, a converse holds: 


Theorem 2.49 A set C C H fora Hilbert space H is a GB-set in H if and only 
if ya(C, d) < œ for the usual metric d on H. 


Proof. For “if;’ the isonormal process L is T-subgaussian, so Theorem 2.48 
applies. For “only if,’ suppose C is a GB-set in H. The first step is the following, 
related to the Sudakov minoration. As usual for x € H andr > 0, B(x,r) = 


D: lly —xll <7}. 


Proposition 2.50 There are finite, absolute constants L, and L3 satisfying the 
following. Let a > 0 and let t; fori = 1,...,m be points of H. Assume that 
It; — tv || = a for i 4i'. Let o > 0 and for each i =1,...,m let H; be a 
nonempty set included in B(t;,0) and J := U; H;. Then 
EL(J)* > 7 vies m- Lø ylogm+ min EL(H. (2.45) 
1 <i<m 
Remark. If o < a/(2L, L2) then (2.45) implies 


EL(J)* > 57 Viogm + min ELH}. (2.46) 
1 <i<m 


Proof. We can assume that m > 2. For each i < m let 
Y; := L(H)* — L(t). 
For each t € Hj, ||t — t;|| < 20, and so for each u > C, by Theorem 2.47(b), 
Pr(|Y; — EY;| > 20u) < 4exp(—(u — C)/4). 
Letting V := mMaX1<i<m |Y; — EY;| we then have for each u > C 
Pr(V > 20u) < 4mexp(—(u — C)*/4). (2.47) 


By the identity (2.9) we then have 
[o0] 
EV/Qo) < C +f min(1, 4m exp(—(u — C)?/4) du. 
c 


Letting up := C + 2./In(4m), the minimum is < 1 if and only if u > uo, and 
the integral from up to +00 is < 1, so for some absolute constant L» and all 
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m > 2, 

EV < o(4C +2 + 4yln(4m)) < LoovVInm. (2.48) 


For each i = 1,...,m, Y; > EY; — V > minj<j<m EY; — V. So 
max L(t) = Y; + L(t) > L(t) + min EY; -V 
teH; l<j<m 
and 
max L(t) > max L(ti)+ min EY;-— V. 
teH l<i<m l<j<m 
Then, take expectations of both sides, apply the Sudakov minoration in expec- 


tation form, Theorem 2.22, to the first term on the right, use E L(t;) = 0 for each 
i in the second, and apply (2.48) to the third, proving the proposition. 


To continue the proof of “only if” in Theorem 2.49, recall that N,, = 2?" and 
that A(A) is the diameter of a set A. For any set A C H let Fo(A) := EL(A)*, 
which for A finite or countable is the same as E sup,<4 L(t). For Lı and L3 as 
in Proposition 2.50 let r := max(4, 2L; L2). We have the following fact, which 
produces partitions nearly as good as those desired, and combined with a short 
argument following it will lead to the proof. 


Theorem 2.51 For eachn =0,1,2,..., letm := N,+ı and O(n) := 2"/2. Let 
T be aGB-set in H and assume that for any s € T, anyawithO <a <rA(T), 
and any ti,...,tm in B(s, ar) such that ||ti — tj || > a whenever i # i’, and 
any nonempty sets H; such that H; C B(t;,a/r), we have 


n{ U n) > ab(n+1)+ min (H;i). (2.49) 
l<i<m 
l<i<m 

Then there exists a nested sequence {Ax}x>0 of partitions of T with card(A,) < 
Nn+1 for each n such that for a universal constant K, 


( Fo(T) 
ge = 


Proof. The hypothesis will be applied only to a of the form r~/~! where j in 
the set Z of integers (positive, negative, or 0), so that in (2.49), a/r = r-J~?. 

The sequence {.A,} of partitions will be defined recursively. For each n 
and each C € An, a point tc € C, an integer j(C) € Z and numbers b;(C) 
for £ = 0, 1, and 2 will be defined, and it will be shown that the following 
properties hold: 


CO 
k/2 
sup 2E A(A,(t)) < Kr 
up), 


y a(r)) (2.50) 


C C B(te, r JO), (2.51) 
so that A(C) < 2r-J©, 
Fo(C) < bo(C), (2.52) 
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and for allt € C, 
Fo(C N BE, rO) < bi(C) (2.53) 
and 
Fo(C N Bit, r4O~*)) < ba(C). (2.54) 
It will also be shown in all cases that 
bı(C) < bo(C) (2.55) 
and for each C € An, letting €n := Fo(T)/2", that 
bo(C) — rOn) < bC) < bo(C) + én. (2.56) 


Last but not least it will be shown for each n > 0, each C € A,, and each 
A € Ani; with A C C that 


2 
> b(A) + (1 — 27?) JOO + 1) 


= (2.57) 


2 
1 
< J bO) + 5A = 21 HOO) + Enpi. 
£=0 
Forn = 0, define Ao : {T}, bo(T) := b\(T) := b2(T) := Fo(T) and choose any 
tr € T. Let j(T) be the largest integer such that T C B(tr, r~/”). It follows 
that 


rD- < A(T) < 2P, (2.58) 


For n = 0, (2.51) through (2.56) all clearly hold and (2.57) does not yet apply. 
The cardinality of Ao is 1 < N; = 4. 

Now for the recursion-induction step, suppose that for a given n = 0, 1,..., 
a partition A, with cardinality at most N„+1 and the points t4 and numbers j(A) 
and b(A) have been defined with all the given properties (2.51) through (2.57) 
holding. We want to define a partition A,+; so that all properties continue 
to hold and the sequence of partitions is nested. Each set C € A, will be 
decomposed into at most m := N,,41 sets in A,+1, so since Neg = Nn+2, the 
desired bound on the cardinality of A,+1 will hold. 

Let Do := C and j := j(C). Choose tı € C such that 


Fo(C N Bt, 4) > sup Fo(C A BE, r77) — Eng. (2.59) 
teC 
Set Ay := CN B(t, roi}. 
For 1<i<m—1, if tj and A; for j =1,...,i have been defined, let 
D; := C \ Lees A;. If D; is empty, the decomposition of C is finished. If 
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not, choose t;4; € D; such that 


Fo(D; N B(tj41, 777-7) > sup Fo(D; N B(t, r>) — enpi. (2.60) 
teD; 
Let Aiyı := Dj N B(tiz1, r171). If eventually A,,_; Æ Ø is defined, then let 
Dn-1 := C\U j<m Ai- If Dm-1ı is empty the decomposition of C is finished, 
otherwise let Am := Dm-1. Thus C is decomposed into at most m = N,,+1 sets 
A; as desired. 
Let A be a set in this decomposition. If A = Am then define j(A) := j = 
JCC), ta = tc, bo(A) := bo(C), bi(A) := bi(C), and 


bo(A) := bo(C)— riton + 1) + Engi. 


Then (2.51), (2.52), (2.53), and (2.55) hold for A since they did for C by 
induction hypothesis. Writing (2.56) for A in place of C and n + 1 in place of 
n, it holds by definition of b2(A). 

To prove (2.54) for A = Apm, take any point t € A and call it tm. For 1 < 
i <m, we have t; € D;_; by the definitions. It follows that if i’ <i then 
d(t;, ty) > ri~}. Then by (2.49) fora = r~/—!, s = tc using (2.51), and H; := 
D; N B(ti41, r~4~7) it follows that 


Fo(C)> rn + + min (FD; N Bli, r 1). (2.61) 
By (2.60), since tm € D; for 1 < i < m — 1 and then since A C Dj, we have 
Fo(Dj N B(ti41,77!~*)) = Fo(Dj N Bim, rI) = Ena 
= Fo(A N Btn, pry) — Enq: 


Since Fo(C) < bo(C) by induction hypothesis (2.52), it follows from the pre- 
ceding display and (2.61) that 


b(C) = F100 +1) — Enpi + FOCA N Bln, r77). 


Recalling that t„ is an arbitrary point of A = Am, and the definition of b2(A), 
(2.54) is proved in this case. 
To prove (2.57), by the definitions including O(n) := 2"/2 we have 


2 
So bA) + = 271) IO 100 + 1) 
£=0 


= 2bo(C) + d(C) — 27? Ien + 1) + ng (2.62) 
< 2bo(C) + b\(C) — rI710(n) + Engr. 


By (2.56) we have bo(C) < b2(C) + r~/—!6(n), and so (2.57) follows. 
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The case A = Am is now finished. Let A = A; for some i < m. Define 
J(A) := j + 1 and t4 := t;, so that 


A = A; C Bhi, r) = Bitar ™) 
and (2.51) holds for A. Define 
bo(A) := b(A) := bı(C), b(A) := min(bı (C), b2(C)). 


The conditions (2.55) and (2.56) for A are immediate. To prove (2.52) for A, 
we have in view of (2.53) for C 


Fo(A) < Fo(C N Bi, r71) < bi(C) = bo(A). 
To prove (2.53) for A, using (2.54) for C, for any t € A, 
Fo(AN Bt, r 44-1) < F(C A Bt, r=?) 
< min(b\(C), b2(C)) = b(A). 
Relation (2.54) for A follows from (2.52) for A because b(A) = bo(A). 
To prove (2.57) for A, we have 


2 2 
YO bel A) < 2b1(C) + d(C) < Y` be(C) (2.63) 
t=0 t=0 
since b}(C) < bo(C) by (2.55) for it. We have j(A) = j(C) + 1, andr7!0(n + 
1) < 0()/2 since r > 4, we have 


i S 
rIA Ten +1) < z7 OTe), 


which together with (2.63) proves (2.57) for A. 

The inductive definition and proof of the given properties of the partitions 
An and associated points and numbers is now complete, and it remains to prove 
the conclusion (2.50). By (2.57) for each t € T and any n = 0, 1,..., setting 
Jh) = j(An(t))s 


2 
Yo be Ansi(t)) + (1 — 279) rOn + 1) 
{=0 


2 
1 ; 
< JO be(An()) + 5 (L-27?) O16) + en. 
t=0 


We have b(T) = Fo(T) for each £, and since each b(A) > 0 by (2.52) through 
(2.54), summing the previous relations from n = 0 to q gives 


q q 
i 1 : 
(1 20 y O16 + ISA F(T) + 5 — 2-17) Tr HO l0(n), 
n=0 n=0 


(2.64) 
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and so, since the term for n + 1 = q + lis > 0, 


1 Lo ss ; 
T a a — 271?) I-19 (0), 
n=0 
By (2.51), A(A,(t)) < 2r7*© and by (2.58), r-/-! < A(T). Thus (2.50) 
follows, and Theorem 2.51 is proved. 


To finish the proof of “only if” in Theorem 2.49, note that Proposition 2.50 
implies that the hypotheses of Theorem 2.51 hold. The cardinalities of the A, 
given are bounded by N,,,; rather than N, as one could like. To remedy this, 
define a sequence {6,,} of partitions by B, := {T} for n = 0 and B, := Anı 
forn > 1 (so that 6, = {T} also). Then 6, has cardinality < N, for each n and 
{G,} is a Talagrand sequence. We have 

[o6] [e0] 
JO ZPA, O) = V2) 72"? AAO), 

n=1 k=0 
and the supremum of the latter over all ¢ is finite by Theorem 2.51, while 
A(Bo(t)) = A(T) is bounded uniformly in t, so y2(T, d) < œo and the proof 
of Theorem 2.49 is complete. 


Next, (2.41) implies the following relation: 


E sup |Y; — Y;| = E[supY, — inf Y,] = 2E sup Y, < 1472(S, d). (2.65) 
s,teS tes ses teT 
Now to relate the above chaining method to metric entropy, recall that for 
é>OandAcS, N(e, A) := N(e, A, d) is the minimum number of sets of 
diameter at most 2e that cover A. Theorem 2.36 states that if C C H, a Hilbert 
space, with 


f > vlog N(t, C)dt < 0, (2.66) 
0 


then C is a GC-set (and hence a GB-set). Suppose that 


CO 
i Vlog N(t, S)dt < oo. 
0 


We can assume that S is infinite since otherwise there is no problem about 
sample boundedness or uniform continuity. Then N(e, C) + +00 as e | 0. 
Also, (2.66) implies that (S, d) is totally bounded, since otherwise N(t, C) = 
+oo for all small enough t > 0, so N(t, C) < +00 for all t > 0. For k = 
1,2,..., let 


er i= infle>0: N(e,S) <2? }. (2.67) 


Thus there is a collection of at most 27" sets B j of diameters at most 3e% 
which cover S. Define a partition C, consisting of all those sets B; \ LU; _ j Bi 
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which are nonempty, which also contains at most 227 sets, each with diameter 
at most 3ez. 

Now recursively we define a nested sequence of partitions as follows. Let 
Ao = {S}. Given a partition A; for k > 0, containing at most 2%” sets, let Ax 
consist of all nonempty intersections of a set in A; and a set in Cz+}1. Then Ax+1 
is a partition, consists of sets with diameter at most 3¢,+1, and has in it at most 


g+ PE a 


sets, so the recursion can continue and {A,;},>0 is a nested, Talagrand sequence 
of partitions of S. 


We have diam A;(t) < 3e; for all k > 1 and all t. Clearly €% is a nonincreas- 
ing sequence as k increases. The integral in (2.67) is bounded below by 


[e0] 
Viog2 Die — £44120? 


k=1 


[o0] [o0] 
(k—1)/2 _ 4(k-2)/2] — 1 1\ 1 k/2 

> viog2 ) [26702 _ 26-9] = /log2 (= -— 5) 32? 12. 38%, 
which implies that y2(S, d) < œo, and so, gives that a set C C H satisfying 
(2.66) is a GB-set. 

In the above argument, the diameters of all sets in A, were bounded by 
the same bound, in this case 3¢,, as is required by methods based on covering 
or packing numbers (metric entropy). The generic chaining method is more 
flexible in that it allows the diameters of different sets in the partition A, to be 
quite different. 


2.11 Homogeneous and Quasi-homogeneous Sets in H 


We know that there are GB-sets in a Hilbert space that are not GC-sets, for 
example, if {e,},>1 is an orthonormal basis, the set {e, / Jlog n}n>2; or, to geta 
compact set, that sequence together with 0. The non-GC property of the set is 
local around 0, as outside of any neighborhood of 0, the set is finite. Thus in a 
sense the set is highly inhomogeneous. For more homogeneous sets in a sense 
to be defined, the GB- and GC-properties are equivalent: 


Theorem 2.52 Let T C H. Suppose there exists a law (Borel probability 
measure) jt on T such that for some M < œ f(r) :=suUpyer u(B(x,r)) < 
M infyer u(B(y,r)) for allr > 0. Then the following are equivalent: 
(i) T is a GC-set; 
Gi) T is a GB-set; 
(iii) [dog D(e, T))"2de < ov. 
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Proof. From the definitions, (i) always implies (ii). We know that (iii) always 
implies (i) by Theorem 2.36. It remains to show that (ii) implies (iii). It follows 
from (ii) that T is totally bounded, by the Sudakov minoration Theorem 2. 14. 

We can assume that T is infinite. It follows that as r 0, we have f(r)/MJ|0 
and so f(r))/0. We can assume that Lı and L2 in Proposition 2.50 are 
both > 2. Let p := 1/(3L;L2). 

For any r > 0 there are D(r, T) points x; in T more than r apart. Thus the 
balls B(x;, 7/2) are disjoint and D(r, T) f(r/2)/M < 1. 

The integral in (iii) can be restricted to the range (0,diam(T)]. It will suffice 
to show that 


I := { ” /iox(M [fr] dr < 00. (2.68) 
0 


We have I = )°°° Ij where 


p ; 
Ij := / Vlog(M/f(r/2))dr < uj i= p!J/log(1/f(p/t!/2)). (2.69) 
pitl 


Let jı := 1 and recursively, given ją for some k > 1, let jx) be the least j 
such that f(p//2) < f(p*)*/M?. Let 


Jk j< j+ 
If jk+1 = jk + 1, then simply vk = u j. If jk+1 = jk + 2 then 


jk+ı—2 
Uk = Ujy-1 + >» Uj < Ver +v 
J=jk 


where 


vir := p% jlog (1/44 /2) and v2 := p*\/2log(1/0+!/2). (2.70) 


To apply Proposition 2.50, consider a ball A := B(x, p*) for any x € T and 
k = 1,2, .... For the metric space (A, d) (where d is the usual metric on H, 
restricted to A) and € > 0 let 


D>(e, A) := sup{m : forsome ti, ..., tm € A, d(ti, tj) =e for i Æ j}. 


Let a := p++! /2 and m := Ds(a, A). Take t; from the definition. Then the 
open balls B(t;, a) cover A, so 


f (0%) /M < WA) < mf (p**172), 


Thus by definition of jx+1, 


m > f (o*) / (Mf (P™*/2)) > If] f (p112). 
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Leto := př < pa < a/(2LıL2) and H; := B(t;, o) fori = 1,...,m. Then 
Proposition 2.50 and (2.46) apply and give 


EL(B (x Ape * tog 
, ae ee f (o+! /2) 


Iterating this with k replaced by k + 4 and x by any t; gives 


) tnin EL (B (t, P=). 


[e9] 
; 1 
too > WL ELT = X'p» iog (aaa) 


s=0 


for u = 2,3, 4,5, and therefore 


a 1 
Xot log (a) < +00. 


w=2 


It follows that ri Uki + vg2 < 00 for v4; defined by (2.70), and so (iii) holds, 
completing the proof. 


Corollary 2.53 (Fernique) Let (T, d) be a metric space such that there is a 
group G of 1-1 transformations g of T onto itself for which 


(a) d is G-invariant: for alls, t € T andg € G, 
d(g(s), g(t) = d(s, t); 
(b) There is a law (Borel probability measure) u on T which is G-invariant, 
ie. uo g7! = u forall g € G; 
(c) G acts transitively on T : forall s, t € T there isa g € G with g(s) = t. 
Then the hypotheses and thus the conclusion of Theorem 2.52 hold. 


Proof. The hypotheses imply that for each r > 0, u(B(x,r)) is the same for 
all x € T. Thus Theorem 2.52 applies for any M > 1. 


For examples of the situation in Corollary 2.53 consider the following. A 
topological group is a group G with a topology under which the group operation 
(g, h) +> gh is jointly continuous from G x G into G and the inverse g +> g`! 
is continuous from G into G. Only Hausdorff topologies will be considered. A 
reference for the following is Nachbin (1965). If G is locally compact, there 
exist so-called left and right Haar measures 1; and ur, which are strictly 
positive on all nonempty open sets, finite on all compact sets, and such that 
for every Borel set A C G and g € G, uj(gA) = (A) and u,(Ag) = L(A). 
Each of u; and u, is unique up to a positive multiplicative constant. G is called 
unimodular if the left and right Haar measures coincide. All compact groups 
are unimodular (Nachbin 1965, Chapter 2, Proposition 13, p. 81). Thus, for 
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every compact Hausdorff topological group G, there is a probability measure 
p on the Borel sets which is both left and right invariant and is unique with 
either property. It will be called the Haar measure on G. 

Let E be a Hausdorff topological space and G a topological group. Then E 
is said to be a homogeneous space under G if we have a jointly continuous map 
(g, x) +» gx from G x E onto E such that (gh)x = g(hx) forall g, h € G and 
x € E,ex = x forall x € E where e is the identity element of G, and such that 
for every x, x’ € E there is some g € G such that gx = x’. 

If G is compact then E is necessarily also compact. In that case there 
is a unique probability measure m on the Borel sets of E such that m is G- 
invariant, meaning that for each g € G, the map x +> gx of E to itself preserves 
m (Nachbin 1965, Chapter 3, Theorem 1 p. 138, Corollary 4, p. 140). 

If G is a compact, metrizable group, then there exists a metric d on G which 
is two-sided invariant, meaning that for any g, h, j e G we have d(g, h) = 
d(jg, jh) = d(gj, hj) (Hewitt and Ross 1979, Theorem 8.6, p. 71). If E is 
a homogeneous space under G, then for any x, y € E we know that G, y := 
{g € G: gx = y} is nonempty. Let py(x, y) := inf{d(g,e): g € Gx, y}. The 
infimum is actually attained, by compactness and joint continuity. Clearly pg is 
nonnegative. By the joint continuity, G,,, is closed, and for x # y it does not 
contain e. It follows that pg(x, y) > 0. We have p(x, y) = pay, x) because by 
the two-sided invariance of d, d(g, e) = d(g™', e). For the triangle inequality, 
given any x, y, and z € E, if gx = y and hy = z with g(x, y) = d(g, e) and 
pa(y, z) = dh, e), then hgx = z, so 


pa(x, z) < d(hg, e) < d(hg, g) + d(g, e) = d(h, e) + d(g, e) 
= pa(x, y) + paly, 2). 


So pg is a metric on E. It clearly satisfies pa(x, y) = pa(gx, gy) for any 
x,y E€ E and g € G, i.e., pg is invariant under the action of G. 

Thus all the hypotheses of Corollary 2.53 hold whenever T is a homoge- 
neous space under the action of a compact metrizable group G. There are many 
examples of such T and G. One class of them is as follows. In R¢ for any d > 2, 
let S^! be the unit sphere {x € R’ : |x| = 1} and let G be the group O(d) 
of all orthogonal transformations U of R? onto itself, in other words linear 
transformations U such that Ux - Uy = x - y for the usual inner product. One 
can also take SO(d), the special orthogonal group of orthogonal transforma- 
tions (given by matrices) with determinant 1. For d = 2, we get T = S', the 
unit circle x? + y? = 1 in R?, and G = SO(2) is the group of rotations, The 
unique invariant probability measure is dm(@) = d@/(27r). Here d may be any 
rotationally invariant metric (or pseudometric) on the circle. (Unlike m, d is 
not at all unique; for example, if p is a G-invariant metric, so is p“ for any a 
with 0 < a < 1.) (See Problems 11 and 12.) 


14:22 


P1: KNP 


CUUS2019-02 CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


2.12 Sample Continuity and Compactness 121 


To see how Theorem 2.52 applies beyond Corollary 2.53, suppose one wants 
to prove sample-continuity of a Gaussian process on a locally compact but not 
compact metric space, such as a Euclidean space or a noncompact manifold. 
Then it suffices to prove sample continuity on each of a family of compact sets 
whose interiors form a base for the topology, such as balls or cubes in Euclidean 
spaces. Then one can often define a measure, such as Lebesgue measure in a 
Euclidean space, restrict it to a compact set C, and normalize it to have mass 1 
to get a law u. Here u(B(x, r)) may not depend on x while B(x, r) is included 
in the interior of C, but become smaller as x approaches the boundary of C, 
yet the hypothesis of Theorem 2.52 still holds. See for example problem 13. 


2.12 Sample Continuity and Compactness 


This section will show that for a Gaussian process X, indexed by a compact 
metric space, or other suitable parameter space such as an open or closed set in a 
Euclidean space, sample continuity reduces to that of the isonormal process on 
some subsets, and continuity of the nonrandom function t œ> EX, (Corollary 
2.56). 

Let (T, T) and (W,U) be two topological spaces. Let {X,, t € T} be a 
stochastic process defined over a probability space (Q, B, P) with values in 
W, meaning that for each t € T and Borel set B C W, X7! (B) € B. (Recall 
that the o-algebra of Borel sets is generated by the open sets and that it is 
equivalent to assume X7 !(U) € B foreach U € U.) Let {Y,, t € T} be another 
process with values in W, possibly defined over a different probability space 
(Q, B', P’). Recall that the processes {X,} and {Y;} have the same laws iff 
for every n =1,2,..., andt,...,t, € T, the law of X; F= on the product 
o-algebra in W” is the same as that of {Y;, = Two processes with the same 
laws are said to be versions of each other. A process {X;},er will be called 
version-continuous iff there is a process Y with the same laws such that for all 
ow’ E€ Q', t > Y,(o’) is continuous from T into W. (Equivalently, continuity 
need only hold for almost all w since then without changing the laws, for the 
set of measure 0 of values of w for which Y,(w’) is not continuous, one can 
replace it by a fixed continuous function, say having a constant value in W.) 

Now let W = R with usual topology and suppose {X,;, t € T} is a Gaussian 
process. Suppose also that (T, e) is a metric space with the metric topology on 
T. We have 


Theorem 2.54 A Gaussian process {X,} indexed by a metric space T, defined 
on a probability space (Q, P), is version-continuous if and only if both 


(a) the nonrandom function t œ> EX, is continuous, and 


(b) the process {X; — E X,+} is version-continuous. 
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Then, t œ> X,(-) is continuous into L?(P). 


Proof. “If” is clear. To prove “only if,’ suppose X, is sample-continuous. For 
any sequence t, — t € T, sample continuity implies that X,, —> X, almost 
surely, and therefore in probability. For jointly Gaussian random variables, 
convergence in probability is equivalent to convergence in L?(P), since for the 
Gaussian variables Y, := X,, — X, to converge to 0 in probability, the means 
EY, must converge to 0, and so must the variances. Thus EX, > EX;. Since 
T is a metric space, (a) follows. Then by subtracting the continuous function 
EX,, (b) follows. 


So in studying sample continuity or version-continuity of Gaussian processes 
we may as well restrict ourselves to processes with mean 0. Let X; be such a 
process, t € T. Each X,(-) is an element of a Hilbert space H, namely L?(P). 
Consider the isonormal process L on this H. Then since L is Gaussian, has 
mean 0, and preserves covariances, we see that L(X,) has the same laws as X,. 

If h(-) is a continuous function from T into a Hilbert space H, with range 
C := {h(t): t € T}, and if L restricted to C is version-continuous, then the 
process L oh is clearly version-continuous. Conversely, if (T, e) is compact 
and h is 1-1, then h is a a homeomorphism (RAP, Theorem 2.2.11). Then, 
version continuity of L on C and Loh on T are equivalent. So, for (T, e) 
compact and t +> X;,(-) one-to-one, version continuity of the Gaussian process 
X, reduces to that of L on a subset C of H. (Theorem 2.55 and Corollary 
2.56 below will show that the 1—1 assumption is not actually necessary.) If T 
is locally compact, for example an open or closed subset of some R*, then 
continuity is equivalent to continuity on each compact subset. 

Let T be a set and d a pseudo-metric on T: for all x, y,z € T, d(x, y) = 
d(y, x) and d(x,z) < d(x, y) + d(y, z), d(x, x) = 0, but possibly d(x, y) = 0 
for some x Æ y. Recall that for a set S C T, the diameter (with respect to d) is 
defined by diam S := diamyS := sup{d(x, y): x, y € S}. 

The next fact holds for general, not necessarily Gaussian processes. 


Theorem 2.55 Jf (T, e) is a compact metric space, h is a continuous function 
from T onto a metric space K and Y(x, œw), x € K, w € Q, is a stochastic 
process on K with values in a complete separable metric space S, then Y oh 
is version-continuous on T if and only if Y is on K. 


Remark If Y(x, w) = Y(x), a nonrandom function, then the result is a known 
fact in general topology (RAP, Theorem 2.2.11). The difficulty in the proof 
here is that if Y o h is version-continuous, it is not clear that the corresponding 
sample-continuous process X can be written as Y’ o h for a process Y’ on K. 


Proof. “If” is obvious. Conversely let Y oh be version-continuous and take 
a process X on T with the same laws as Yoh and tr X(t, œw) := X;(@) 
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continuous on T for all w. Let p and ¢ be the metrics on K and S respectively. 
Let d(s, t) := p(h(s), h(t)), a pseudo-metric on T. Let A be a countable dense 
subset of T, and B := h(A) := {h(a): a € A}, so B is a countable dense subset 
of K. For y, ô > 0 and any countable set F C T define a random variable by 


D(F, ô, y) := sup {¢ (Xs, X;): s,t E€ F, d(s,t) < ô, e(s,t)< y}. 


This is measurable since F is countable and S is separable (RAP, Proposition 
4.1.7). Let D(F, 5) := D(F, 6, 1 + diam, T), and 


[oe [e0] 
UC := N U {D(A, 1/m) < 1/n}, 

n=1 m=1 
so that UC is measurable. If P(U C) = 1, then the sample functions t œ> X, (œ) 
are almost surely uniformly continuous with respect to d on A. Since A is 
countable and Y o h has the same laws as X, Y oh also has sample functions 
almost surely uniformly continuous for d on A. Equivalently, Y has sample 
functions almost surely uniformly continuous for p on B. For any x € K let 
Y’(x) := lim{Y(u): u > x, u € B}. Almost surely, all these limits exist (by 
uniform continuity and since § is complete), and x +> Y’(x) is continuous. 
Since X and Y’ o h both have continuous sample functions on T and have the 
same law on A, they have the same laws on T, and so does Y o h by choice of 
X. If follows that Y’ has the same laws as Y on K, so Y is version-continuous 
as desired. 

Otherwise, P(U C) < 1. Then for some € > 0, 


inf P(D(A, 6) > 3e) > 3e. 


Then by inclusion (monotone convergence, with 6 = 1/m), 


P (A {D(A, 8) > i > 3e. (2.71) 


5>0 


On the other hand, continuity of t > X, and compactness of T imply that for 
some y > 0, and any countable set C C T, 


P{D(C,1+diamgT, y) > €} < €. (2.72) 


(Otherwise, take a countable union of countable sets for y = 1/n, n= 
1,2,..., to get a contradiction.) 

T is a finite union of e-open sets T; with diam,7; < y. For each i Æ j, 
if there are s € T; and t € T; with h(s) = h(t), then we say (i, j) € £, and 
let us choose and fix such s = s(i, j) and t = t(i, j). Let C be the union of 
A and the set of all s(i, j) and t(i, j). Since £ is finite, we can assume that 
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Xsa,pl@) = Xtra, pœ) for all w and (i, j) € L. Let 


CO 
J:= ane 1/n) > 3e} N{D(C, 1 + diamgT, y) < €}. 

n=1 
Then (2.71) and (2.72) give P(J) > 3e — € > €. Fix an w € J and choose 
Sn E€ C and t, € C such that d(s,, tn) < 1/n and ¢(X;,, X1,)(@) > 3e. By com- 
pactness, we can assume that the sequences s, and t, both converge for e and 
hence also for d. Let s, —> s and t, — t. Then d(s, t) = 0, Xs, (@) > X5(@) 
and X,,(@) > X;(@) asn —> œ. Lets € T; andt € T;. Ifi = j we have, since 
w E J,3€ < (Xs, XıX@) < e€, a contradiction. If i Æ j, then (i, j) € £. 


For n large enough, s, € T; and t, € T;, so 


3e <E (Xs, Xn) (@) < E (Xs Xs, p) (@) +6 (Xia. Xn) (@) 
<e+e = 2e < 3e, 


again a contradiction. 


Now recall that a totally bounded set C in a Hilbert space H is called a 
GC-set iff L restricted to C has a version with uniformly continuous sample 
functions. 


Corollary 2.56 A Gaussian process {X,, t € T} with mean 0 on a compact 
metric space (T, e) is version-continuous if and only if botht œ> X;(-) € H := 
L?(P) is continuous and its range K is a GC-set. 


Proof. Apply Theorem 2.55 with h(t) := X,(-), K := A(T), p the usual 
metric in H, and S = R with its usual metric; again L(X,(-)) has the same 
laws as X,. If X, is version-continuous, then t +> X;,(-) is continuous into H 
by Theorem 2.54, and the rest follows. 


Example. If X, is a Gaussian process defined for t € R, suppose X; is periodic 
of period 27, X; = Xi+2x for all t. Suppose that E((X; — X,;)?) > 0 for |s — 
t| < 27. Then we can write the process as X, = Y(e'’) where Y is a process 
indexed by the unit circle T! := {z: |z| = 1} in the complex plane, which is 
compact. Version continuity for X and Y are equivalent, and z+» Y(z)(-) is 
1-1 from T! into H := L?(P), so version continuity is equivalent to that of L 
on the range of Y in H (without needing Theorem 2.55 and Corollary 2.56). 
On the other hand, any process indexed by R is version continuous if and only 
if it is so on each compact interval [—N, N], where in this example for N > x, 
the process is not 1-1 into H. 


Recall that a sample function of a stochastic process X; is a function t > 
X,(q@) for a fixed w. The usual metric on Hilbert space is the natural one for an 
isonormal process, but the GC-property holds for other metrics in the following 
sense: 


14:22 


P1: KNP 


CUUS2019-02 CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


2.13 Two-Series and One-Series Theorems 125 


Theorem 2.57 Let C be a subset of Hilbert space H. Then C is a GC-set if 
and only if there exists a metric p on C such that (C, p) is totally bounded, 
and the sample functions of the isonormal process L on C can be chosen to be 
p-uniformly continuous a.s. 


Proof. “Only if” is immediate where p is the usual metric. To prove “if;’ take a 
version of L such that on a set of probability one, the sample functions of L are 
p-uniformly continuous on C. Then L extends to a Gaussian process t œ> X, 
on the compact completion M of C for p. Here X, is version-continuous, and 
so by Corollary 2.56, C is included in, and thus is, a GC-set. 


2.13 Two-Series and One-Series Theorems 


The following material was not needed so far in the text, but it can be helpful 
in some of the problems. 

For independent real random variables X,,, Lévy’s equivalence theorem says 
that three ways for the series ) “> , X, to converge are equivalent: almost surely, 
in probability or in distribution (e.g. RAP, Theorem 9.7.1). Let the variables, 
truncated to have absolute values < 1, be a = X,, if |X,,| < 1 and 0 otherwise. 
The three-series theorem (e.g. RAP, Theorem 9.7.3) says that the almost sure 
convergence of >, X, is equivalent to convergence of all three of three series of 
numbers: X2, P(|Xn| > 1), >), EX} (which need not converge absolutely), 
and $`, Var(X}). 

Ifthe variables satisfy further conditions, the conditions can simplify. Specif- 
ically, we have the following “two-series” theorem. 


Theorem 2.58 Let X, be independent, nonnegative real random variables. 
Then for $7; Xn to converge almost surely (to a finite limit random variable), 


it suffices that $°, EX, < +00. It is equivalent that the following two series 


n=1 
should both converge: 


(a) Yo P(X, > 1), 


Proof. The sequence of partial sums S, = ys Xj; is nondecreasing up to 
some limit S% < +oo. If E EX, < +œ then by Fatou’s lemma or the 
monotone convergence theorem, ES. < +00, so Soo is finite almost surely 
and the series converges almost surely. 

Now suppose both series (a) and (b) converge. Then by series (a) and the 
Borel—Cantelli lemma, almost surely X,, = x} for all n > no(@) large enough. 
We have X`; X} converging almost surely by series (b) and the first part of 
the proof, and so $S 


n=] Xn converges almost surely. 
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Conversely, if }°°°, X„ converges almost surely, then both series (a) and 
(b) must converge by the three-series theorem. This completes the proof. 


So, for nonnegative variables one need not consider the variances of the 
truncated variables X}. It is actually easy to see why this is, namely, 


Var(X,) < EXD’) < E(X)) 


for all n, so convergence of series (b), for nonnegative variables, implies that 
of series (c) in the three-series theorem. 

For series of independent normal variables X,, with EX, = 0, convergence 
will reduce to that of one series of numbers, the variances. There are also other 
aspects of convergence that can be included. A series is said to converge uncon- 
ditionally if it can be rearranged in any order and converges to the same limit. 
For a given series }_„ an of real numbers, it is known that unconditional con- 
vergence is equivalent to absolute convergence, i.e. )~,, |an| < 00. If (S, || -JD 
is a Banach space, then a series }_„ sn of elements of S is said to converge 
absolutely iff X- ||Sn|| < 00. It is known that in infinite-dimensional Banach 
spaces, unconditional convergence of series is not equivalent to absolute con- 
vergence, and we will see that in a Hilbert space H. Here is an equivalence, 
where the one series of real numbers is the one in part (e): 


Theorem 2.59 Let G; be independent N(0, oÊ) random variables defined on 
a probability space (Q, P). Then the following are equivalent: 


(a) Ya G; converges almost surely; 

Oly) G? < co almost surely; 

(c) ber G; converges in the Hilbert space H = L*(Q, P); 
(d) The series in (c) converges unconditionally in H; 


(e) 107 < oœ. 


Proof. By the Lévy equivalence theorem, (a) is equivalent to convergence in 
probability. For jointly Gaussian random variables, convergence in probability 
is equivalent to convergence in L?, as was noted in the proof of Theorem 2.33 
(for a Gaussian variable to be close to 0 in probability, its mean, in this case 0, 
and its variance must be small). Thus (a) is equivalent to (c). A series Dz Si 
of orthogonal elements of H converges in H if and only if per IIs; ||? 
Thus (e) is equivalent to (c) as well as (a). By the two-series Theorem 2.58, 
(e) implies (b). Conversely, if (b) holds, it follows from the Borel—Cantelli 
Lemma that $`; P(|G;| > 1) < oo, and so a7 > 0 as i > oo. By the two- 
series theorem, $`; E(G?1\c,)<1) < 00, and the terms are asymptotic to ož, so 
(e) holds and (b) is equivalent to (e). 


< Ww. 
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Convergence of the series in (e), if it holds, is clearly absolute and 
unconditional, thus its convergence is equivalent to (d). So the theorem is 
proved. 


Example. Suppose o? = 1/i? for all i. Then all the equivalent conditions in 
Theorem 2.59 hold. We have ||G;|| = 1/i in H, so the series y 4 G; does 
not converge absolutely in H, giving an example where unconditional conver- 
gence is not equivalent to absolute convergence in H, an infinite-dimensional 
Hilbert space. Moreover, the series in (a) does not converge absolutely: with 
probability 1, by the two-series theorem, $272; |G;| = +00, and so for almost 
all œw, X}; Gi(w) does not converge unconditionally. 


Problems 
1. If X and Y arei.i.d. N(O, 1), evaluate E max(X, Y). 
2. Evaluate E exp(a||X \|7)) (finite or infinite) as a function of a > 0 if 


(a) £(X) = NO, 1) in R, |X|] = |X); 
(b) £(X) = N(0, C) in R?, C = @ 9), and ||(x1, x2) | = x? + x3)”. 


3. Let H be the Hilbert space L?({0, +00), A) where À is Lebesgue measure. 
As usual let 14 be the indicator function of a set A, i.e. 14(x) = 1 for x € A 
and 0 otherwise. 


(a) Show that for the isonormal process L on H, x, = L(1,0,,;) for t > 0 gives 
a Brownian motion (is a Gaussian process with correct mean and covariance). 


(b) For 0 <t < 1 find functions g, such that y, = L(g;) gives a Brownian 
bridge. 


4. Let H be a Hilbert space with orthonormal basis {e;}j>1. Let G, be inde- 


pendent with laws N(O a2). 


(a) Under what conditions on o does >> Gne, converge almost surely in the 
norm of H? Hint: Apply the two-series Theorem 2.58 to suitable real-valued 
random variables. 

(b) If G = }„ Gren in H as in (a), where the sum converges almost surely, 
find for what a > 0 we have E exp(a||G||?) < oo. After doing this directly, 
compare with the results of Section 2.3. 


5. Let G, be i.i.d. N(O, 1) variables and a, > O for each n. Under what condi- 
tions on a, is ae an|Gn| < co a.s.? Hint: Use the two-series theorem. 


6. (a) Show that for any set A in a real vector space V, and any vector space W of 
linear forms on V, the polar A*! of A, defined by A*! := {w € W : w(v) <1 
for all v € A}, is convex in W. 
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(b) Let C C V be a set and D its convex hull, the smallest convex set including 
C. Show that D*! = C*!. 


(c) In V = R? let C be the unit square {0 < x < 1, 0 < y < 1}. Evaluate the 
polar C*!, where W = R? and w(v) := wivi + wv. 


7. Let H be a Hilbert space with orthonormal basis {e,},>1. For c, > O let 
E({cn}nz1) = (ogni Xnen > Ponsi x2 /c? < 1}, an infinite-dimensional ellip- 
soid. Show that E is a GB-set if and only if )>,, c2 < oo. 


8. With notation as in the previous problem, let C := {e,/(logn)!/? : n > 2}. 
Show that C is not a GC-set (it is a GB-set as shown in the example before 
Theorem 2.16). 


9. Let y be a characteristic function on R, so that W(t) = [© ed P(x) for 
some law P on R, which we assume is symmetric, P(A) = P(—A) for all 
Borel sets A. (Then y must be real-valued.) Show that there exists a Gaussian 
process X,, t € R with mean 0 and covariance EX,X, = W(s — t) for all real 


s,t. 


10. (a) For each t > 0 let y, be a “triangle function” on R with w,(0) = 1, and 
for some t > 0, y (s) = 0 whenever |s| > t, while y; is linear on each interval 
[—t, 0] and [0, t]. Show that y, satisfies the conditions of the previous problem. 
Hint: Find its (inverse) Fourier transform and show that it is a probability 
density. 

(b) Let y be a continuous real-valued function on R which is even, W(—x) = 
w(x), Y(0) = 1, and on [0, oo), w is nonincreasing, nonnegative, and convex. 
Show that yw is a characteristic function. Hint: Use a mixture of triangle func- 
tions. First consider the case that w is piecewise linear and is 0 outside some 
finite interval. Take limits of such piecewise linear functions. 


11. Fora > 0 let y(x) = 1 — dog(1/|x|))~* for x in some neighborhood of 0 
(piecewise linear elsewhere). 


(a) Show that there exists such a y satisfying the conditions of Problem 9, 
assuming Problem 10. 


(b) What can be said about sample-continuity of the Gaussian process X, for 
different values of a? 


12. Let X;(@) := he G,,(@) cos(nt) where G, are independent random 


n=—0Oo 
variables with laws N(0, o?) and )~, «2 < oo. Show that the process t œ> X; 


? n 


is version-continuous if and only if {X,(-) : 0 < t < 2x } is a GC-set in LQ). 
13. Let X;(@) := Y G,(@) cos(nt) + H, sin(nt) where G,, and H, for all 
n are independent random variables (also independent for different n) where 


for each n, both G,„ and H, have laws N (0, o7) and Xa o? < oo. Show that the 


an 
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process t +» X; is sample-continuous if and onlyif T := {X;(-): 0 <t < 27} 
is a GC-set in L?(Q), and if so, T satisfies the hypothesis of Theorem 2.36. 
Hint: Show that Corollary 2.53 applies, in that for points p; := (cos f, sin t) of 
the unit circle, d(s, t) := [E(X, — X;)°]"/” is rotationally invariant, although it 
is in general not a usual metric such as distance in R? or arc length distance. 


14. Let {X;},eR be a stationary Gaussian process with mean 0, where stationarity 
means that for any n = 1,2,..., any t),...,f, E€ R, and any h € R, the joint 
distribution of {X;,};_, is the same as that of {X;,+;}_,. Suppose that t +> 


X,(-) is continuous in probability, or equivalently into L?(Q). Show that: 


(a) The process is version-continuous if and only if it is when restricted to the 
interval [—1, 1]. 

(b) Show further that for the set C := {X;(-)}-1<;<) in H = L?(Q, P), version- 
continuity holds if and only if C satisfies the hypothesis of Theorem 2.36. Hint: 
Show that the hypothesis of Theorem 2.52 holds with M = 2 and u equal to 
Lebesgue measure over 2, even though the (pseudo)metric on [—1, 1] induced 
by the process is not in general the usual one. 


15. Let e, be orthonormal in H and C := {a,(log n) l en}n>2 where a, —> 0 
as n —> œo. 


(a) Show that every such set is a GC-set. 


(b) By taking an —> 0 slowly enough, Show that for any r > 0 there exist 
GC-sets C with D(e, C) > exp[1/(e?| log £|")] for € > 0 small enough. 


16. (A further extension of problem 10). For c >Q let y(x) := 1- 
(log(1/|x|))~'dog log(1/|x|))~¢ for x in some neighborhood of 0 (piecewise 
linear elsewhere). 


(a) Show that there exists such a y satisfying the conditions of Problem 9, 
assuming Problem 10. 


(b) What can be said about sample-continuity of the Gaussian process X, for 
different values of c? 


(c) Show that for any r < 2 there exist non-GB-sets C such that for € > 0 
small enough, with D(e, C) < exp(1/(e?| log £|"). Compare with Problem 15, 
part (b), to see that for 0 < r < 2, one cannot tell from D(e, C) whether C is a 
GC-set or not. 


17. Let vg be the Lebesgue volume of the unit ball in R*. Then it is known 
that vg = a*l? Td + (k/2)) for k = 1,2,.... For any c; > 0, the ellipsoid 
& i= Eda) i= {x Ta ie < 1} c RÝ has volume vc] .. . cz for 


any c; > 0, i = 1,..., k. Fore >Oletm := Dee, Ex). 


(a) Show that m > cic2-- -cp 8. 
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(b) If cj > efor j = 1,...,k, show that m(e/2)* < Deen +++ Cg. 

(c) Ifc; = j-* for j = 1,2,..., for the infinite-dimensional ellipsoid € equal 
to E({c;}j>1 as in Problem 7, give upper and lower bounds for D(e, E) as €40. 
Hint: Choose k, depending on €, to give as good bounds as possible. Recall 
Stirling’s formula k!/{(k/e)*(22k)!/7] — 1 as k —> œ (Theorem 1.17). Are 
your bounds consistent with Theorems 2.14 and 2.36 and the result of Pro- 
blem 7? 


18. If g is a Young—Orlicz modulus, x? = o(log g(x)) as x —> +00, and Y isa 
N(O, 1) random variable, show that || Y ||, = +00. 


19. If g is a Young—Orlicz modulus, log g(x)) = O(x?) as x + +00, and Y is 
a N(0, 1) random variable, show that ||Y||p < oo. 


20. Let f(x) =a — x for 0 <x < x and f(x) = —x — x for =r <x <0. 
Find the Fourier series 


[0.6] 
fret > an sin(nx) + bn cos(nx) 


n=1 
in L?((—z, 7)) and use it to prove yo n>? = 1? /6. 


21. (a) Find a numerical value of the absolute constant C in Proposition 2.46 
used in bounding the difference of the mean E F and median m(F’)) of a Lipschitz 
function F with respect to N(O, I), |EF — m(F)| < C||F||z. Use the method 
of proof for that Proposition and the bound œ, (r) < 2 exp(—r?/4) given in The- 
orem 2.43. Also use that for any r > 0, P(F — m(F) > r) < 1/2 by definition 
of median, since P(F < m(F)) > 1/2. 

(b) Find a value of C by another method using the inequality not proved in 
this text (but in Ledoux’s 2001 book), P(F — EF > r||F||lz) < exp(—r?/2), 
so if the right side is < 1/2, then m(F) — EF < r||F||z, and considering — F, 
Im(F) — EF| < r||F |z- 


Notes 


Notes to Section 2.3. Proposition 2.5 improves on RAP, Lemma 12.1.6(b) 
by a factor of 2. Lemma 2.10 was first proved by Landau and Shepp (1971), 
then by Fernique (1970) whose note giving a much shorter proof appeared in 
print earlier. Fernique’s statement about an error in Landau and Shepp’s paper 
apparently had to do with an earlier, unpublished draft of the paper. The main 
theorems 2.6 and 2.11, giving the best possible upper bound for œ, were then 
proved independently by Marcus and Shepp (1972) and Fernique (1971). The 
current exposition is based mainly, though not entirely, on Fernique (1975). 
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The Darmois-Skitovič theorem says that if X1, .. ., Xņ„ are independent real 
random variables where for some constants a1, ... , a and bi, ..., bn, a1 Xi + 


--++a,X,, is independent of bı Xı + --- + bnXn, with a;b; Æ 0, then for that 
i, X; has a normal distribution: see Darmois (1951) and Skitovič (1954). 
C. R. Rao (1973, pp. 158—163 and 218) lists various characterizations of normal 
distributions. 


Notes to Section 2.4. Slepian (1962, Lemma 1) proved his inequality. Sudakov 
(1969) announced a result somewhat weaker than Theorem 2.14. Sudakov 
(1971, Theorem 5) announced a stronger result, corrected in Sudakov (1973) 
to be what is here Theorem 2.14. Lemma 2.15 is essentially from Chevet 
(1970). Fernique (1975, Théorème 2.1.2, Corollaire 2.1.3) proved Theorem 
2.16. The proof given is as in Fernique (1997, pp. 59, 63—67), who mentions an 
idea of Kahane (cf. Kahane 1986). Fernique actually proved the more general 
(2.19) and (2.20) in Theorem 2.18. Inequality (2.21) is given in Giné and Zinn 
(1986) assuming that T is countable. To allow T uncountable (but separable), 
a different hypothesis is used here. Sudakov (1973, Proposition 7) proved the 
inequality in Theorem 2.22, assuming D(e, S, d) > 10, and with the constant 
(1 — e™!)/2 in place of 1/17 (for D(e, S, d) < 10 it is easy to check that the 
inequality must hold, possibly with a smaller constant). 


Notes to Section 2.5. Sudakov (1971) stated that a set K in Hilbert space is 
GB if and only if its mixed volume h,(K) homogeneous of degree | is finite. 
In my review of that paper in Mathematical Reviews I wrote that “Up to the 
present, no such geometric criterion for the GB-property was known, so that the 
theorem is of great interest.” Then Sudakov (1973) pointed out that finiteness 
of hı(K) is equivalent to that of EL(K)*. My unduly brief review of the 1973 
paper in Math. Revs. did not even mention EL(K)* although it turned out to 
appear much more often in later literature. 


Notes to Section 2.6. Theorems 2.23 and 2.25 and Corollaries 2.26 and 2.27, 
and essentially Corollary 2.28, are due to T. W. Anderson (1955). These facts 
were to some extent rediscovered by L. Gross (1962). Both used the Brunn— 
Minkowski inequality in their proofs. See also Borell (1974, 1975a,b), Gordon 
(1985), and Kahane (1986). 


Notes to Section 2.7. Lemma 2.29 is essentially due to T. W. Anderson (1955). 
Lemmas 2.30 and 2.31 are straightforward. In Dudley (1967a, Theorem 4.6), 
parts of which are due to Gross (1962), most of the conditions in Theorem 2.32 
were proved equivalent for convex, symmetric sets. Feldman (1972) proved 
part (b’), about the symmetric convex hull of C. 


Notes to Section 2.8. In apparently the first public use of a metric entropy 
hypothesis to obtain sample continuity of Gaussian processes, V. N. Sudakov, 
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in a talk at the International Congress of Mathematicians in Moscow in 1965 
announced that if log N(e, C) = O(e™) as e}0 for some r < 2, then C is a 
GC-set. Sudakov (1969) was his first publication on the topic. Specific bounds 
on metric entropy implying sample continuity were suggested by previous 
work of Fernique (1964) for processes indexed by a real parameter. The fact 
that i (log N(e, C))!/2de < œo implies C is a GC-set was given (with an 
equivalent series instead of the integral) in Dudley (1967a). Theorem 2.36 and 
its proof were given in Dudley (1973). Theorem 2.37 has often been attributed 
to me, but I did not prove it. I only gave the integral on the right side. Sudakov 
(1973) first gave the expectation on the left, but as mentioned in the Notes 
to Section 2.5, I did not for some time take note of it. One source for the 
formulation and a proof (with an extension) of Theorem 2.37 is Pisier (1983). 
The proof given above is based (with some changes) on Ledoux and Talagrand 
(1991, Section 11.1) where the Theorem is extended, with different functions 
go, to a large class of non-Gaussian stochastic processes. Komatsu’s inequality, 
used in the proof of Lemma 2.41, is quoted, with hints for the proof, in Itô and 
McKean (1974, p. 17), citing as original source Komatsu (1955). 


Notes to Section 2.9. As mentioned at the beginning of the section and at 
several points in it, the section is based on the book Ledoux (2001). Some 
details not given in the book are filled in. 


Notes to Section 2.10. This section is based on early parts of the book of 
Talagrand (2005). 


Notes to Section 2.11. Fernique (1975) proved that for Gaussian processes sat- 
isfying a homogeneity condition like that in Corollary 2.53, the metric entropy 
integral condition (or the corresponding condition on N(¢e, T), equivalent by 
Theorem 1.9 above) is necessary and sufficient for sample continuity. 


Notes to Section 2.12. I have no reference for Theorem 2.55. The facts in this 
section on Gaussian processes were to some extent stated, but not proved, in 
Dudley (1967a, 1973). Andersen and Dobrié (1988, Lemma 8, (4.6); preprint, 
1985) first published proofs of Corollary 2.56 and Theorem 2.57 according to 
Giné and Zinn (1986, p. 58). 
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Foundations of Uniform Central Limit 
Theorems; Donsker Classes 


3.1 Definitions: Convergence in Law 


Empirical processes ./n(P, — P) as mentioned in Section 2.1 will be defined 
here with more detail and precision than there. Let (S, 6, P) be a probabil- 
ity space, to be called the sample space. Examples to have in mind for S 
are Euclidean spaces such as the plane R*. To form empirical measures, one 
would like to take variables X1, X2,..., i.i.d. with law P. To do this, take 
a countable product S$% of copies of (S, B, P) (RAP, Theorem 8.2.2) and let 
X; be the coordinates on the product. A product may be taken with another 
probability space. Throughout the rest of this book, X; will be defined as such 
coordinates unless something is said to the contrary. An example showing 
that use of coordinates on product spaces, rather than just having X1, X2,... 
iid. (P), makes a difference will be given at the end of Section 5.3. Recall 
that the product o-algebra is the smallest for which all the coordinates are 
measurable. 

iL, ôx, and the empir- 
ical process v, := n'/?(P,, — P). So P, isa probability measure on S, defined 
on B and actually on all subsets of S, for any values of X,,..., X,. Each v, is 
a finite signed measure of total charge 0. 

Recall (Section 2.1) the Gaussian process Gp(f) defined for f € 
LS , B, P): Gp has mean 0 and covariance equal to the covariance for P 
(2.2). Given f € L°(P), let mo(f) := f — S fdP. Then mo( f) € £3(P). 

A semi-inner product (-,-) on V x V for a vector space V satisfies the 
definition of inner product except that possibly (u, u) = O for u Æ 0. 

For f, g € L’(P), again recalling (2.2), 


E(Gp(f)Gp(g)) = (F, 8)0,p = (f — Tof), 8 — To(8)) 


Then, we can form the empirical measures P, := 


133 
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where (-, -) (without subscripts) is the usual semi-inner product on L(P), 
(f, 2) :=(f, g)p := f fg dP, and (-, -)o.p is another semi-inner product. 

Let LAP) be the set of all functions f € £°(P) such that f fdP =0. 
Then zp is linear from £7(P) onto LP). Let LAP) be the set of all equiv- 
alence classes of elements of LP) for equality a.s. (P). On EAP), (C, oP 
= (9x: 

Thus, restricted to LAP), Gp is an isonormal process. Let C be the one- 
dimensional space of constant functions c as a subspace of L7(P). Then Gp 
is 0 on C, while the spaces C and LP) are orthogonal complements of each 
other (RAP, Theorem 5.3.8) in £*(P). For any f and g in £L7(P) and c € R, 
Gp(cf + 8) = cGp( f) + Gr(g)as. (2.4). 

Recall the notion of pseudometric as defined after (2.4). The covariance for 
P defines a pseudometric on £L7(P) by pp(f, g) := (E(Gp(f) — Gp(g))”)'”. 
Since f = g a.s. implies Gp(f)= Gp(g) a.s., pp defines a pseudometric 
on L?(P), which is the usual Hilbert metric on Li(P). Thus, the results of 
Chapter 2 for the isonormal process L apply to Gp on LA(P), with inner 
product (-, -)p = (+, -)o,p there, and the Hilbert metric is pp. 

The Brownian bridge process y, is a special case of the Gp process where 
P is Lebesgue measure on [0, 1] and y, = Gp(1j0,:3); it is easily checked that 
this has the right covariance. 

The Brownian bridge process can be taken to have continuous sample paths, 
in other words, to be continuous as a function of t for each œ (RAP, Theorem 
12.1.5). But except in special cases, such as that the sample space S is a 
finite set, the spaces L°(P) and LAP) are infinite-dimensional, in the sense 
that they contain infinite orthonormal sets. We saw in Section 2.8 that an 
isonormal process on an infinite-dimensional Hilbert space H is not sample- 
continuous. Then G p is not sample-continuous on the whole space L7(P). We 
will be concerned then with suitable subsets of £7(P). A class F c £? will 
be called pregaussian if a Gp process (f, w) + Gp(f)(@) can be defined on 
some probability space such that for each w, f > Gp(f)(@) is bounded and 
uniformly continuous for pp from F into R. 

Given F C L?(P), let mo(F) be the set of all functions mo(f), f €F. 
For any f € L(P), pP(f, mo(f)) = 0, and Gp(f) = Gp(a0(f)) a.s. I claim 
that F is pregaussian if and only if mo(F) is: if 2o(F) is pregaussian, then 
f +> Gp(a0(f)) has the desired properties on F. Conversely if F is pregaus- 
sian, take a probability space (Q, A, Q) and a Gp process on F over this 
probability space such that G p(-)(@) is bounded and pp-uniformly continuous 
on F for all w. For each g € m(F) there is a nonempty set C(g) of constants 
c such that g +c € F. For any c,d € Cr(g) and any œw, Gp(g+c)(@) = 
Gp(g+d)(@) since pp(g+c,g+d)=0. Define Hp(g)(w) as any such 
Gp(g+c)(@). Then Hp has the desired properties of Gp on mo(F), as 
claimed. 
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Now recall the definition of GC-set from Subsection 2.2.1 above. We have 
the following: 


Theorem 3.1 Let F C L?(P). Let F' be the set of all equivalence classes in 
Li(P) of functions in 1o(F). Then F is pregaussian if and only if F' is a GC-set 
in L2(P), 


Proof. Let L be the isonormal process on L?(P). The three stochastic processes 
indexed by F, {Gp( f): f € F},{Gp(ao(f)): f € F} and {L(mo(f)): fE 
F} are equal in distribution. By definition a GC-set is totally bounded and L on 
it can be chosen with uniformly continuous, thus bounded sample functions. 
So the “if” part follows. Conversely if F is pregaussian, F’ must be a GC- 
set. 


Recall the notion of prelinear function (Lemma 2.30). A Gp process Y ona 
class F C £7(S, B, P) for a probability space (S, B, P) will be called coherent 
if for each w € Q, the function f œ> Y(f Xœ) on F is bounded, pp-uniformly 
continuous, and prelinear. 


Theorem 3.2 Given a probability space (S, B, P) and F C L£°(S, B, P), F is 


pregaussian if and only if there exists a coherent G p process on F. 


Proof. “If” follows from the definition of pregaussian. For the converse, apply 
Theorems 3.1 and 2.32. 


Now, if a class F C £(P) is pregaussian we can ask whether v, converges 
in distribution, or in law, to Gp with respect to uniform convergence over F. 
Recall that for random variables Y,, with values in a separable metric space S, 
convergence in law of Y,, to Yo is defined to mean that Eg(Y,,) —> Eg(Yo) as 
n — oo for every bounded continuous function g on S. But empirical processes 
take values in nonseparable metric spaces in general. Consider the following: 


Example. Let U be the U[0, 1] distribution function as in Section 1.1 above. 
Let X, have this distribution function and let U, be the empirical distribu- 
tion function, so that U;(x) = 1 if x > Xı and 0 otherwise. The empiri- 
cal process ./n(U,, — U) for n = 1 is just U} — U. For each possible value 
y of Xı we get a function G,(x) = (U; — U)(x) = —x for 0 < x < y and 
l—x for y <x <1 (equal to O outside [0,1]). As in Section 1.1, con- 
sider the supremum norm and distance, for a bounded real-valued function 
g on [0, 1], |lgllsup := SUPo<x<1ı |g(x)| and for two such functions g and h, 
dyup(g, h) := |g — Allsup. Let G1 be the set of all G, for 0 < y < 1. For y £y 
in [0, 1] and the corresponding U; we have 


dsu(Gy, Gy) = (Oy — U) — (U1 — U)I\sup = IIU; — Uillsup = 1 68-1) 


since if, for example, y’ < y, then for each x with y’ <x < y we have 
U(x) = 1 and U;(x) = 0. So, in the metric dsup, G1 is a discrete, complete 
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(and thus closed in any set including it) set with any two points in it at distance 
1 apart. Thus every subset F of G, is also complete and closed. There exists 
a bounded continuous function H on the space £%[0, 1] of all bounded real 
functions on [0, 1] which equals 1 on F and 0 on G; \ F, by the Tietze extension 
theorem (such an H can be defined explicitly as H(g) := max(0, | — d(F, g)) 
where d(F, g) := infer ||G — gllsup). If A is anon-Lebesgue measurable sub- 
set of [0,1], which exists assuming the axiom of choice (RAP, Theorem 
3.4.4) and F :={G,: y € A}, then for X; having a U[O, 1] distribution, 
EH(Gy)= ik H(Gy)dy = (A), which is undefined, where à is Lebesgue 
measure. 

So, the definition of convergence in distribution, or weak convergence, of 
probability measures on the Borel sets of separable metric spaces, in terms 
of integrals of bounded continuous functions converging, does not apply 
to empirical processes, because the integrals of some bounded continuous 
functions do not exist. In a nonseparable metric space, we cannot necessarily 
expect to have distributions defined on all Borel sets. A new definition of con- 
vergence in law is needed to take care of this nonmeasurability. The definition 
will involve the notion of upper integral. Let g be a real-valued, not necessarily 
measurable function defined on a space X where (X, S, m) is a measure space. 
Let R be the set [—00, 00] of extended real numbers. Then the upper integral is 
defined by 


J* gdu := inf{fhdu: h > g, h measurable and R-valued}, 


which will be undefined if there exists a measurable h > g with f hdu = œ — 
oo undefined, unless there is also a measurable y > g with f Ydu = —oo, in 
which case f* gd will also be defined as —coo. There always exists at least 
one measurable h > g, namely h = +00. 

We will be dealing often with compositions of functions. If f is a function 
whose domain includes the range of g, then either f(g) or f o g will denote 
the function such that (f o g)(x) = f(g(x)). 

A function, which may not be measurable, from a probability space into 
a metric space will be called a random element. Now here is a definition of 
convergence in law where only the limit variable Yo necessarily has a law: 


Definition. Let (S, d) be any metric space. Let (Qa, An, Qn) be probability 
spaces for n = 0,1,2,..., and Y,, n > 0, functions from Q, into S. Suppose 
that Yo takes values in some separable subset of S and is measurable for the 
Borel sets on its range. Then Y,, will be said to converge to Yo in law as n > 
oo, in symbols Y,,=> Yo, if for every bounded continuous real-valued function 
gon S, 


J da f ena 
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Note. If f G(Yo)d Q is defined for all bounded continuous real G (as it must 
be if Y,—Yo, by definition), then the image measure Q o Yọ ' is defined on 
all Borel subsets of S (RAP, Theorem 7.1.1). Such a law does have a sep- 
arable support except perhaps in some set-theoretically pathological cases 
(Appendix C). 


For g bounded, f* g(Y,)dQ, is always defined and finite. Then, here is a 
general definition of when the central limit theorem for empirical measures 
holds with respect to uniform convergence over a class F of functions. The 
metric space S will be the space £% (F) of all bounded real-valued functions on 
F, with metric given by the supremum norm || || F := sup{|H(f)|: f E F} 


Definition. Let (2, A, P) be a probability space and F C £°(P). Then F will 
be called a Donsker class for P, or P-Donsker class, or be said to satisfy the 
central limit theorem (for empirical measures) for P, if F is pregaussian for P, 
Gp is coherent, and v, =G p in €°(F). 


Later, a number of rather large classes F of functions will be shown to be 
Donsker classes for various laws P. The next few sections develop some of the 
needed theory. 


3.2 Measurable Cover Functions 


In the last section, convergence in law was defined in terms of upper inte- 
grals. The notion of upper integral is related to that of measurable cover. 
Let (2, A, Q) be a probability space. Then for a possibly nonmeasurable 
set A C Q, a set B is called a measurable cover of A if A C B, B € A, and 
P(B) = inf{P(C): A C C, C measurable}. If B and C are measurable covers 
of the same set A, then clearly so is B N C. It follows that B = C up to a set 
of measure 0, in other words P(BAC) = 0 where A denotes the symmetric 
difference, or equivalently P(1g = Ic) = 1. 

For any set A C Q let P*(A) := inf{P(B): A C B, B measurable }. Then 
for any measurable cover B of A, clearly P*(A) = P(B). 

Let L° := LQ, A, P, R) denote the set of all measurable functions from 
Q into R. Then £? is a lattice: for any f,g € L°, f Vg := max(f, g) and 
f Ag := min(f, g) are in £°. But this £’ is not a vector space since we could 
have, for example, f = +00 and g = —o, so f + g would be undefined. 

The map y+ tan™! y is one-to-one from R onto [—2/2, 2/2]. Then a 
metric on R is defined from the usual metric on [—2/2, 2/2] by d(x, y) := 
|tan~! x — tan7! y|. On £? we have the Ky Fan metric (RAP, Theorem 9.2.2) 
defined by 


d(f, g) := inf{e > 0: P(d( f(x), g(x)) >€) < e}. 
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Then d( f, g) = 0 if and only if P(f = g) = 1. 

For any set J C £°(Q, A, P, R), a function f € £? is called an essential 
infimum of J, or f :=ess.inf J, iff for all j € J, f < j a.s. and for any 
g € L? such that g < j as. for all j € J, we have g < f as. If f and g 
are two essential infima of the same set J, then clearly f = g a.s. A set J 
of functions will be called a lower semilattice if for any f, g € 7, we have 


min(f,g)€ J. 


Theorem 3.3 For any probability space (Q, A, P) and set J C LQ, 
A, P, R), an essential infimum of J exists. If for some function f : Qt R we 
have J = {j € L? : j > f everywhere}, then f* := ess.inf J can be chosen 
so that f* > f everywhere. Also, f f*dP and E*f := f* fdP are both 
defined and equal if either of them is well-defined (possibly infinite), for exam- 
ple, if f* is bounded below. 


Proof. Let Jı be the class of all functions min( fi, ..., fm) for fi,..-, fn € I 
and m = 1,2,.... Then Jı is a lower semilattice. For f € L?, f = ess. inf J 
if and only if f = ess.inf 7. So we can assume J is a lower semilattice. For 
each j € J, tan”! jis a measurable function with values in [—z /2, 1/2]. Take 
jm € J such that ftan7! jnd P| inf jez ftan~! jdP. Then min(ji, ..., jm) is 
in J and decreases to g as m —> œ for some g € £L°(Q, A, P, R). For any h € 
J,min(h, ji, ..-, jm) min(h, g) so ftan~'min(h, g)dP = f tan™!(g)d P and 
g < h a.s., so g satisfies the definition of ess.inf 7. If J = {h € L’: h> Fi; 
then J is a lower semilattice and for g constructed as above, g > f everywhere. 

By the definitions, /* fdP < f f*dP if either side is well-defined, and the 
inequality is an equation by the definition of essential infimum. 


Definition. For any f as in Theorem 3.3, f* will mean a function as shown to 
exist in the theorem with f* > f everywhere and will be called a measurable 
cover function of f. 


Recall that in Chapter 2, L(A)* was the essential supremum of L(x) for 
x € A, and so, the essential infimum of random variables Y such that for each 
x € A, Y > L(x) a.s. — a different, although related, notion. 

If f is real valued and bounded above by some finite valued measurable 
function, then f* is a measurable real-valued function. But whenever there 
exist nonmeasurable sets A,|@ with P*(A,) = 1, as for Lebesgue measure 
(e.g. RAP, Section 3.4, Problem 2), let f := n on An \ An+1. Then f is real 
valued but f* = +00 a.s. 

The next two lemmas on measurable cover functions are basic. 


Lemma 3.4 For any two functions f, g: Qt (—o~, oo], we have 
(a) (f + 8 < f* + g“ a.s., and 
(b) (f — g)* = f* — g* whenever both sides are defined a.s. 
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Proof. (a) We have —oo < f* < +00 and —oo < g* < +00 everywhere, 
so f* + g* is an everywhere defined, measurable function > f + g, and (a) 
follows. For part (b), on the measurable set where g* = +00, on which by 
assumption f* is finite a.s., the right side is —oo and the inequality holds. Where 
g* is finite, g is also finite and f = (f — g) + g, so f* < (f —g)* + g* by 
(a), so f* — g* < (f — g)“, since this holds where (f — g)* < oo and where 
(f — g)* =o. 


Lemma 3.5 Let S be a vector space with a seminorm || ||. Then for any two 
functions X,Y from Q into S, |X + Y|* < (|X + IYID* < IX + IYI" 
a.s. and |\cX||* = |c|||X||* a.s. for any real c. 


Proof. The first inequality is clear, the second follows from Lemma 3.4, and 
the equation is clear (for c = 0 and c # 0). Next, in some cases 


of independence, the upper-star operation can be distributed over products or 
sums. 


Lemma 3.6 Let (Q;,A;, Pj), j =1,...,n, be any n probability spaces. Let 


fj be functions from Q; into R. Suppose either 
(a) fj =>9, j=l,...,n, or 


(b) fi = landn =2. 


Then on the Cartesian product MQ, Aj, Pj) with x = (X1,..-,Xn), 
if f(x) := TT) fir) we have f*(x) = Mf (xj) a.s., where Oœ is set 
equal to 0. 


(c) Or, if fj(xj) > —ooforallx;, j =1,...,n, and g(x,...,Xn):= fila) + 
++ fn(Xn), then g*(x1,.--,%n) = fOr +e + fn) as. 


Proof. First, for (c), by induction we can assume n = 2. We have 


g (u,v) = glu, v) = fiu) + filv) (3.2) 


for all u, v by the definitions. We have g*(u, v) < fi(u) + f3 (v) a.s. by Lemma 
3.4(a), and if equality does not hold a.s., there is a rational £ such that on a 
measurable set C of positive probability in the product space, g*(u, v) < t < 
fi (u) + f(v), and there exist rational q, r with q +r > t such that C can be 
chosen with fř(u) > q and f;(v) > r for (u, v) € C. Let C, := {v : (u,v) € 
C}. By the Tonelli—Fubini theorem there is a measurable set D C Qı with 
Pı(D) > Osuch that P2(C,,) > Oforallu € D.If fi < q on D, then f¥ < q a.s. 
on D, but for any u € D and v € C, # Ø we have fř(u) > q, a contradiction. 
So choose and fix a u € D with fi(u) > q. Then for any v € Cu, q + f(v) < 
FU) + folv) < g*(u, v), so f(v) < g*(u, v) — q and f¥(v) < g*(u, v) — q 
for almost all v € C,. For any such v, q + fž(v) < q +r and fž(v)< r,a 
contradiction. So (c) is proved. 
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Now for products, in case (a) or (b), clearly f*(x) < Mf (xj) a.s., with 
1* = 1. For the converse inequality we can assume n = 2, by induction in case 
(a). Suppose f*(x) < fř(x1)fž(x2) with positive probability. Then for some 
rationalr, f*(x) <r < fë(xı) (x2) with positive probability. If fi = 1 this 
gives f(x) < f*(x) <r < fš(x2) on a set of positive probability. Then by the 
Tonelli—Fubini theorem, for some x1, f2(x2) < f*(x1, x2) < r < f3(x2) ona 
set of x2 with P2 > 0, contradicting the choice of f;*. 

So assume fı > 0 and fo > 0. Then as in case (c), using the analogue of 
(3.2) for products of nonnegative functions rather than sums, there are rationals 
a,b with ab >r, a >Q, b >Q, such that on a set C in the product with 
positive probability, ff(x1) >a, fy (x2) > b, and f*(x1, x2) < r. Again by 
the Tonelli—Fubini theorem, there is a measurable set D C Qı with P;(D) > 0 
and P(C,,) > 0, u € D, and there is a point u of D where fı(u) > a. Then for 
any v € Cu, fo(v) < f*(u, v)/a, so f3 (v) < f*(u, v)/a for almost all v € Cy. 
For such a v we have af;‘(v) < ab and f;(v) < b, a contradiction, finishing 
the proof. 


For the next fact here is some notation: given two functions f, g and a 
o-algebra S on the range of f, let (f, g)(x) := (f(x), g(x)) and f7!'(S) := 
{f-(A): A € S}. 


Lemma 3.7 Let (Q, A, P) = Te (Q, Si, P;) with coordinate projections 
THT (41, x2, x3) := xi, i = 1, 2,3. Let Sı ® Sı denote the product o -algebra 
on Qı x Q2. Then for any bounded real function f on Qı x Q3 and 


g(x1, X2, x3) := f(x1, x3), conditional expectations of g* satisfy 
E(g*|\(My, Tz)“ (Si 8 S2)) = E(g*|M17'(S})) as. for P. 


Proof. By Lemma 3.6(b), for Q2 x (Q1 x 3), g* equals P-a.s. a measurable 
function not depending on x2, thus independent of IT, 1S2). Let S be the collec- 
tion of all sets A € (I1, H2)~'(S; @ S2) such that g* andh := E(g*|T1, (SD) 
have the same integral over A. Then S contains all finite disjoint unions of sets 
(Ty, T2)~'(By x Bo) = MT ' (B1) A IG (B2), Bi € Si, i= 1,2, since both 
g* and h are independent of I, '(S»). Now S is easily seen to be a monotone 
class, so it equals all of (11, I2)! (S1 Q S2) (RAP, Theorem 4.4.2). 


Lemma 3.8 Let X be a real-valued function on a probability space (Q, A, P). 
Then for any t € R, 


(a) P*(X >t) = P(X* >t). 
(b) For any € > 0, P*(X >t) < P(X* >t) < P*(X >t-e). 


Proof. Clearly {X > t} C {X* > t} and {X > t} C {X* > t}, so we have “<” 
in (a) and the first inequality in (b). 
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Take a measurable cover AD {X >t}, so P*(X >t) = P(A). Then 
X* < t as. outside A, so P(X* > t) < P(A), proving (a). Thus for 0 < ô < 
€, P(X* >t —5) < P*(X >t — e). Letting 610 proves (b). 


Let (Q, A, P) beaprobability space. Fora function f from Q into [—oo, oo] 
let f, fdP :=sup{fgdP: g measurable, g < f}. Let f, be the essential 
supremum of all measurable functions g < f. Then just as for f*, f, is 
well-defined up to a.s equality and f, fdP = f fad P whenever either side is 
defined, as in Theorem 3.3. 

It is easy to check that f, = —((—f)*) and that f, fdP = —(/* —fdP).So 
the convergence in law Y,,= Yo for functions into a metric space S as defined 
in Section 3.1 implies that f, e(¥ )dQn — J g(Yo)dQo for every bounded 
continuous real-valued function g on S. 


For the next fact, recall that if (S, S, 2) is a measure space, an atom of u 
is a point x € S or the singleton {x} if {x} € S and u({x}) > 0. The measure 
u will be called purely atomic if there is a countable set A € S such that each 
x € A is an atom and u(S \ A) = 0. Next, here is a one-sided Tonelli—Fubini 
theorem for starred functions: 


Theorem 3.9 Let (X, A, P) x (Y, B, Q) be a product of two probability 
spaces. For a real-valued function f > 0 on X x Y, define f* with respect 
to P x Q. For each x € X let (EZ f)(x) := J* f(x, y)dQ(y). Then 


EVES f(x,y) < Sf", y)d(P x Ox, y), (3.3) 


where Ef = E* with respect to P. Also, if Q is purely atomic, with Yi Ody) 
= l for some yj € Y, and Ey(-) := f-dQ, then 


EVE, f(X, Y) < E* f(X, Y) = EEf f(X, Y). (3.4) 
Proof. By the usual Tonelli—Fubini theorem, we have 


S FŒ, y)d(P x Ox, y) = SS f* Œ, y)dQ(y)d P(x), (3.5) 


and f f*(x, y)d Q(y) is measurable in x. For each x € X let f(y) := f(x, y). 
Then f(x, y) < f*(x, y), which is measurable in y, so f(y) < f*(x, y) for 
Q-almost all y. Thus since E* f = Ef* by Theorem 3.3, 


(EZS = S fed) < Sf, yd), 


and (3.3) follows. Next, to prove (3.4), note that on Y, since Q is purely 
atomic, all real-valued functions are measurable (for the completion of Q), so 
Eš = Fn, the left inequality follows from (3.3), and the equality from (3.5), 
so (3.4) is proved. 


Remark. Under usual set-theoretic hypotheses, there exists an “ordinal triangle” 
set A C I x I,J := [0, 1] such that forall x € J, {y: (x, y) € A}is countable, 
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and for all y € J, {x : (x, y) € A} has countable complement in J. (The set is 
described at the beginning of Chapter 5.) Clearly i, is la(x, y)dydx =0 < 
l= rs h 14(x, y)dxdy. The use of stars does not change this pathology: for 
f = 14 in (3.3), the left side is O and the right side is 1. 


We also have a one-sided monotone convergence theorem with 
stars: 


Theorem 3.10 Let (Q, A, P) be a probability space and let f; be real-valued 
functions on Q such that f; + f, ie. fi(x) t f(x) for all x € Q. If E* fi > 
=00, then E* f; T E* fas j > o. 


Proof. By Theorem 3.3, E* f; = Ef; for each j, and fř increase a.s. up to 
some function g where clearly g < f* almost surely. If g < f* on some set 
A with P(A) > 0, then f; < g on A implies f < g and so f* < g on A, a 
contradiction. The theorem follows. 


Note that there exist subsets A; := A(j) of [0,1] with outer measure 
*(A;) = 1 for all j and A;|% (RAP, Problem 3.4.2). Letting fj := 14g) we 
have f;|0, and E* fj = 1 for all j, so that the monotone convergence theorem 
fails for E* for decreasing sequences. Next is a Fatou lemma with stars. 


Theorem 3.11 Let (Q, A, P) be a probability space and f; any nonnegative 
real-valued functions on Q. Then E* liminf jo fj < liminfj—.9 E* fj. 


Proof. For 1<k<m we have inf,s; fa < fm and so E*inf,s; fa < 
E* fm, thus E* infp>k fa < fms, E* fm. Taking the supremum over k on 


both sides and using the previous monotone convergence theorem gives the 


result. 


We need a notion of independence for random elements. Let (A jp, Aj, P j), 
j =1,2,..., be probability spaces, and form a product IE- (Aj, Aj, Pj) = 
(B, B, P) with points x := fxj ar If X; are functions on B of the form 
Xj; =h; (x;), j =1,...,n, where each hj is a function on A ; (not necessarily 
measurable), then we call X; independent random elements. If the hj are 
measurable, this implies independence in the usual sense. 

Suppose given a vector space S with a seminorm || - ||. For a class F of 
measurable functions on a sample space (X, B), S may be the class of all 
bounded functions on F with the supremum norm || - || F, or its subspace 
of prelinear functions. Random elements X; with values in S$ will be called 
symmetric if we can write x; = (y T j) where y; and zj are independent and 
have the same distribution in some space D; (A; = D; x D; with P; = Q; x 
Q; for some Q;) and X; = y (y;) — y (z;) for some function y from Dj 
into S. 
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With these definitions, the Lévy inequality (Theorem 1.20) can be extended 
to starred norms, with about the same proof: 


Lemma 3.12 If X; are independent symmetric random elements and S; = 
X,+---+ Xj, then for any r > 0, we have 

(a) Pr (max j<n si" > r) < 2Pr (| Sn I" > r) , and 

(b) Pr (maxj<n |X; |% >r) < 2Pr(ISull* > r). 


Proof. For (a), let M(œ) := max j< | Sj |. Let C be the disjoint events 
[Mri <r< Sell*}, k = 1,2, ..., where My := 0. By Lemma 3.5, 


2 [Sin lI” < Sn ll* + 2Sm — Snll*, l<m<n, 


so if || Sm ||* > r, then either ||S,,||* > r or ||2S,, — S,||* > r or both. The trans- 
formation which interchanges y; and z; for j > m preserves probabilities and 
interchanges S, and 2S, — Sn, so interchanges ||S,||* and |/25,, — S,||*, while 
preserving all X; for j < m. Thus 


Pr (Cm A [Sali > r}) = Pr (Cm A {125m — Sall* > r}) 


1 
zP" (Cn). 


IV 


Thus Pr(M, >r) = Xh Pr (Cm) < 2Pr((|Sill" > r). For (b), the proof 
is similar: replace S$; for i = j, k or m by X;, and use the transformation 


interchanging y; and z; for j # m, which does not change any || X ;||* or ISa |I*. 


3.3 Convergence Almost Uniformly and in Outer Probability 


In Section 3.1, the definition of convergence of laws was adapted to define 
convergence in law for random elements which may not have laws defined. In 
this section the same will be done for convergence in probability and almost 
sure convergence. 

Let (Q, A, Q) be a probability space, (S, d) a metric space, and fy, functions 
from Q into S. Then f, will be said to converge to fp in outer probability if 
d( fa, fo)* — 0 in probability as n —> œ, or equivalently, by Lemma 3.8, for 
every € > 0, Q*{d( fn, fo) > €} > Oasn > œ. 

Also, fa is said to converge to fo almost uniformly if as n —> 
co, d( fa, fo)* —> 0 almost surely. 

The following is immediate: 


Proposition 3.13 Almost uniform convergence always implies convergence in 
outer probability. 
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If fn are all measurable functions, then clearly convergence in outer proba- 
bility is equivalent to the usual convergence in probability, and almost uniform 
convergence to almost sure convergence. 

Now some definitions will be given for Glivenko—Cantelli properties, which 
are laws of large numbers for empirical measures. 


Definition. If (X, A, P) is a probability space and F is a class of integrable 
real-valued functions, F C £'(X, A, P), then F will be called a strong (resp. 
weak) Glivenko—Cantelli class for P iff as n > o, ||P, — P || — 0 almost 
uniformly (resp. in outer probability). 


In the following Proposition, part (C) is what is usually called “Egorov’s the- 
orem” for almost surely convergent sequences of measurable functions (RAP, 
Theorem 7.5.1). 


Proposition 3.14 Let (Q, A, Q) bea probability space, (S, d) a metric space, 
and f, any functions from Q into S for n =0,1,.... Then the following are 
equivalent: 

(A) fa > fo almost uniformly 

(B) For any € > 0, Q*{sup, sin @ fn, fo) > EHO asm —> ov. 

(C) For any ô > 0 there is some B € A with Q(B) > 1 — ô such that fa > fo 
uniformly on B. 

(D) There exist measurable hy, > d( fn, fo) with h, > O a.s. 


Proof. (A) implies that 


(sup d(fn» fo)” < sup(d (fn, fo)")10 a.s. as m — 00, 


n>m 


which implies (B). 

Assuming (B), for k=1,2,..., let Cy := {SUP n> mk) d( fas fo) > 1/k}, 
where m(k) is large enough so that Q*(C) < 2-*. Take measurable cov- 
ers B; for Cp, so Cy C By, By € A and O(B;) < 2~*. For r = 1,2,..., let 
A, := Q\ Ugs, Be. Then Q(A,) > 1 — 2™" and fa > fo uniformly on A,, so 


(C) holds. 
Now assume (C). Take C} € A, k= 1,2,..., such that Q(C;) ¢ 1 and 
fa > fo uniformly on C. We can take Ci C C C --- . Take mg such that 


d( fa, fo) < 1/k on Cy for all n > mg. Then d(fa, fo)* < 1/k on Ck, so 
d( fa, fo)” > 0 a.s., proving (A). Clearly, (A) and (D) are equivalent. 


Example. In [0, 1] with Lebesgue measure P let Aj D Az D --- be sets with 
P*(A,) = 1 and NZ: An = Ø (e.g. RAP, Section 3.4, Problem 2; Cohn, 1980, 
p. 35). Then 14, —> 0 everywhere and, in that sense, almost surely, but not 
almost uniformly. Note also that 14, does not converge to 0 in law as defined in 
Section 3.1. To avoid such pathology, almost uniform convergence is helpful. 
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Proposition 3.15 Let (S, d) and (Y, e) be two metric spaces and (Q, A, Q) 
a probability space. Let f, be functions from Q into S for n = 1,2,..., such 
that fa — fo in outer probability as n —> œ. Assume that fo has separable 
range and is measurable (for the Borel o -algebra on S). Let g be a continuous 
function from S into Y. Then g( fn) —> 2( fo) in outer probability. 


Proof. Given € > 0, k=1,2,..., let Bg := {x € S: d(x, y) < 1/k implies 
e(g(x), g(y)) < €, y € S}. Then each Bx is closed and B; + Sas k — ow. Fix 
k large enough so that OC fp (Bo) > 1—e. Then 


{e(g( fu), 8(fo)) > ENA fo (Be) C dlas fo) = 1/k}. 
Thus 


O*{e(g(fn), 8(fo)) > £} < E+ O*{d(fn, fo) = 1/k} < 2e 


for n large enough. 


Lemma 3.16 Let (Q, A, Q) be a probability space and {8n}? o a uniformly 
bounded sequence of real-valued functions on Q such that go is measurable. If 
gn — 8o in outer probability, then limsup,_,../*° gndQ < Sf godQ. 


Proof. Let |gn(x)| < M < œ for all n and all x € Q. We can assume M = 
1. Given € > 0, for n large enough Q*(|g, — gol > €) < e€. Let A, be a 
measurable set on which |g, — go| < £ with Q(Q\ An) < £. Then 


SgndQ < e+f4 gndQ < 2+f4, godQ < 38+ f godQ. 


Letting ¢|0 completes the proof. 


On any metric space, the o-algebra will be the Borel o-algebra unless 
something is said to the contrary. 


Corollary 3.17 If f, are functions from a probability space into a metric space, 
fa > fo in outer probability and fo is measurable with separable range, then 


Sn=> fo- 


Proof. Apply Proposition 3.15 to g = G for any bounded continuous G and 
Lemma 3.16 to 8n := Go fna and to g, = —Go fh. 


3.4 Perfect Functions 


For a function g defined on a set A let g[A] := {g(x): x € A}. It will be useful 
that under some conditions on a measurable function g and general real-valued 
f, (f o g)* = f* og. Here are some equivalent conditions: 
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Theorem 3.18 Let (X, A, P) be a probability space, (Y, B) any measurable 
space, and g a measurable function from X to Y. Let Q be the restriction of 
P o g`! to B. For any real-valued function f on Y, define f* for Q. Then the 
following are equivalent: 


(a) For any A € A there isa B € B with B C g[A] and Q(B) > P(A); 

(b) For any A € A with P(A) > 0 there is a B € B with B C g[A] and 
Q(B) > 0; 

(c) For every real function f on Y, (f o g) = f* ogas.; 

(d) For any D C Y, Apogy* = 1%) 0 g a.s. 


Proof. Clearly (a) implies (b). To show (b) implies (c), note that always 
(f og)* < f* o g. Suppose (f o g)* < f* o g ona set of positive probability. 
Then for some rational r, (f o g)* <r < f*ogonaset A € A with P(A) > 
0. Let g[A] D B € B with Q(B) > 0. Then f o g < r on A implies f < r on 
B, so f* < r on B a.s., contradicting f* o g >r on A. 

Clearly (c) implies (d). 

Now, to show (d) implies (a), given A € A, let D := Y \ g[A]. Then we can 
take 17, = lc for some C e€ B: let C be the set where 17, > 1. Then DCC 
and 1p o g = (1p o g)* = 0 a.s on A. Let B := Y \ C. Then B C g[A], and 


Q(B) = 1- O(C) = 1— f 1hd(P o g7’), 


which by the image measure change of variables theorem (e.g. RAP, Theorem 
4.1.11) equals 


1—f1lġogdP = 1—f(pog)*dP > P(A). 


Note. In (a) or (b), if the direct image g[A] € 6, we could just set B := g[A]. 
But, for any uncountable complete separable metric space Y, there exists a 
complete separable metric space S (for example, a countable product N% of 
copies of N) and a continuous function f from S$ into Y such that f [B] is not a 
Borel set in Y (RAP, Theorem 13.2.1, Proposition 13.2.5). If f is only required 
to be Borel measurable, then S can also be any uncountable complete metric 
space (RAP, Theorem 13.1.1). 

A function g satisfying any of the four conditions in Theorem 3.18 will be 
called perfect or P-perfect. Coordinate projections on a product space are, as 
one would hope, perfect: 


Proposition 3.19 Suppose A = X x Y, P is a product probability v x m, and 
g is the natural projection of A onto Y. Then g is P-perfect. 


Proof. Here P o g~! =m. For any B C A let By := {x : (x, y) € B}, ye Y. 
If B is measurable, then by the Tonelli—Fubini theorem, for C := {y : v(By) > 
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0}, C is measurable, C C g[B], and P(B) < m(C), so condition (a) of Theorem 
3.18 holds. 


Theorem 3.20 Let (Q,.A, P) be a probability space and (S, d) a metric space. 
Suppose that for n =0,1,..., (Yn, Bn) is a measurable space, g, a perfect 
measurable function from Q into Y,, and fn a function from Y, into S, where fo 
has separable range and is measurable. Let Q, := P o g7! on By and suppose 
Tn © 8n — foo go in outer probability as n —> œ. Then fa => fo asn —> œ 


for Sn on (Yn, Bn, Qn). 
Before proving this, here is an example: 


Proposition 3.21 Theorem 3.20 can fail without the hypothesis that g, be 
perfect. 


Proof. Let C C I := [0, 1] satisfy 0 = A,(C) < A*(C) = 1 for Lebesgue mea- 
sure à (RAP, Theorem 3.4.4). Let P = à*, giving a probability measure on the 
Borel sets of C (RAP, Theorem 3.3.6). Let Q = C, fo =0, Yn = I, fa := Inc 
forn > 1, and let g, be the identity from C into Y, for all n. Then fa o g, = 0 
forall n, so fa © 8n —> fo © go in outer probability (and in any other sense). Let 
B,, be the Borel o-algebra on Y,, = I for each n. Let G be the identity from 7 
into R. Then /* G(f,)dQ, = J* fadà = 1 for n > 1, while f G(fo)d Qo = 0, 
so f, does not converge to fo in law. 


After Theorem 3.20 is proved, it will follow that the g, in the last proof are 
not perfect, as can also be seen directly, from condition (c) or (d) in Theorem 
3.18. 


Proof of Theorem 3.20. By Corollary 3.17, fn © gn = fo ° go. Let H be any 
bounded, continuous, real-valued function on S. Then by an image measure 
change of variables (RAP, Theorem 4.1.11), 


Í A(fn(8n))dP —> [ Hcsiveona? = f amaos 


Also, 


/ H( fn(8n))d P = f aferar by Theorem 3.3 
= fo o fn) (gnJdP by Theorem 3.18 


= fo o fn)“dQn (RAP, Theorem 4.1.11) 


x 
= / A(f,)d Qn by Theorem 3.3, 


and the Theorem follows. 
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In Proposition 3.19, X x Y could be an arbitrary product probability space, 
but projection is a rather special function. The following fact will show that all 
measurable functions on reasonable domain spaces are perfect. 

Recall that a law P is called tight if sup{P(K): K compact} = 1. A set 
P of laws is called uniformly tight if for every € > O there is a compact K 
such that P(K) > 1 — e for all P € P. Also, a metric space (S, d) is called 
universally measurable (u.m.) if for every law P on the completion of S, S is 
measurable for the completion of P (RAP, Section 11.5). So any metric space 
which is a Borel subset of its completion is u.m. 


Theorem 3.22 Let (S, d) be a u.m. separable metric space. Let P be a proba- 
bility measure on the Borel o -algebra of S. Then any Borel measurable function 
g from S into a separable metric space Y is perfect for P. 


Note. In view of Appendix C, the hypothesis that Y be separable is not very 
restrictive. 


Proof. Let A be any Borel set in S$ with P(A) > 0. Let 0 < € < P(A). By the 
extended Lusin theorem (Theorem D.1, Appendix D) there is a closed set F with 
P(F) > 1 — £/2 such that g restricted to F is continuous. Since P is tight (RAP, 
Theorems 11.5.1 and 7.1.3), there is acompact set K C A with P(K) > £. Then 
C := F N K is compact, C C A, P(C) > 0, and g is continuous on C, so g[C] 
is compact, g[C] C g[A], and (P o g~!)(g[C]) > P(C) > 0, so the conclusion 
follows from Theorem 3.18. 


Let (Q, A, P) be a probability space and g a measurable function from Q 
into Y where (Y, B) is a measurable space. Let Q := P o g™! on B. Call g 
quasiperfect for P or P-quasiperfect if for every C C Y with g7!C € A, Cis 
measurable for the completion of Q. Then the probability space (Q, A, P) is 
called perfect if every real-valued function G on Q, measurable for the usual 
Borel o-algebra on R, is quasiperfect. 


Example. A measurable, quasiperfect function g on a finite set need not be 
perfect: let X = {a1, do, a3, a4, as, do}, U := {a1, a2}, V := {a3, a4}, W := 
{a5,a6}, A:={G,U,V,W,UUV,UUW,VUW, xX}, PU) = P(V)= 
P(W) = 1/3, Y:= {0,1,2}, g(a) := gas) :=0, g(a) == las) := 1, 
g(a4) := g(a) := 2. Let B:= {Ø, Y}. For C C Y, g7!(C) € A if and only 
if C € B, so g is quasiperfect. But, P(U) > 0 and g[U] does not include any 
nonempty set in B, so g is not perfect. 


Proposition 3.23 Any perfect function is quasiperfect. 


Proof. Let C C Y, A := g7! (C) € A. By Theorem 3.18 take B C g[A] with 
B € B, Q(B) > P(A). Then B C C, so Q(B) = P(g87!(B)) < P(87!(C)) = 
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P(A), and Q(B) = P(A). Thus the inner measure Q,(C) = P o g7!(C). 
Likewise, Q,(Y \ C) = (Po gD \ €), so Q*(C) = (P o g7!)(C) and C 
is Q-completion measurable. 


3.5 Almost Surely Convergent Realizations 


First recall a theorem of Skorohod (RAP, Theorem 11.7.2): if (S, d) is a com- 
plete separable metric space, and P,, are laws on S converging to a law Po, then 
on some probability space there exist S-valued measurable functions X, such 
that C(X,,) = P, for all n and X, — Xo almost surely. This section will prove 
an extension of Skorohod’s theorem to our current setup. 

Having almost uniformly convergent realizations shows that the definition 
of convergence in law for random elements is reasonable and will be useful in 
some later proofs on convergence in law. 

Suppose fn = fo where f, are random elements, in other words func- 
tions not necessarily measurable except for n = 0, defined on some probability 
spaces (Q2,,, Qn) into a possibly nonseparable metric space S. We want to find 
random elements Y, “having the same laws” as f,, foreach n such that Y, —> Yo 
almost surely or better, almost uniformly. At first look it is not clear what “hav- 
ing the same laws” should mean for random elements f,,, n > 1, not having 
laws defined on any nontrivial o-algebra . A way that turns out to work is to 
define Y, = fn © 8, where g, are functions from some other probability space 
Q with probability measure Q into Q, such that each g, is measurable and 
Q o g7! = Qn for each n. Thus the argument of f, will have the same law Q,, 
as before. It turns out moreover that the g, should be not only measurable but 
perfect. 

Before stating the theorem, here is an example to show that there may really 
be no way to define a o-algebra on S on which laws could be defined and yield 
an equivalence as in the next theorem, even if S is a finite set. 


Example. Let (X,„, An, Qn) = ([0, 1], B, 4) foralln (A = Lebesgue measure, 
B = Borel o-algebra ). Take sets C(n) C [0,1] with 0=A,(C(n)) < 
M(C(n)) = 1/n? (RAP, Theorem 3.4.4). Let S be the two-point space {0, 1} 
with usual metric. Then fn := lcm) —> 0 in law and almost uniformly, but each 
“law” Bn := Qn o f; ' is only defined on the trivial o -algebra {ø, S}. The only 
larger o -algebra on S is 2°, but no 8, for n > 1 is defined on 25. 


Theorem 3.24 Let (S, d) be any metric space, (Xn, An, Qn) any probability 
spaces, and f, a function from X, into S for each n = 0, 1, .. . . Suppose fo has 
separable range So and is measurable (for the Borel o -algebra on So). Then 
fa => fo if and only if there exists a probability space (Q, S, Q) and perfect 
measurable functions gy from (Q, S) to (Xn, An) for each n =0,1,..., such 
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that Q o g! = Qn on A, for each n and fy © 8n > fo © go almost uniformly 
asn > œ. 


Notes. Proposition 3.21 and the “if and only if” in Theorem 3.24 show that the 
hypothesis that g, be perfect cannot just be dropped from the Theorem. 


Proof. “If” follows from Proposition 3.13 and Theorem 3.20. “Only if” will 
be proved very much as in RAP (Theorem 11.7.2). 

Let Q be the Cartesian product °° Xn x I, where each J, is a copy of 
[0, 1]. Here g, will be the natural projection of Q onto X, for each n. 

Let P := Qo o i on the Borel o-algebra of S, concentrated in the sepa- 
rable subset So. A set B C S will be called a continuity set (RAP, Section 11.1) 
for P if P(0B) = 0 where OB is the boundary of B. Then, 


Lemma 3.25 For any € > 0 there are disjoint open continuity sets U;, j = 
1,..., J, forsome J < œ, where for each j, diam U; := sup{d(x, y): x,y € 
Uj} < e, and with Y5 P(Uj) > 1 — e. 


Proof. Let ETAS be dense in So. Let B(x,r) := {y E€ S: d(x, y) <r} for 
0 <r <œ and x € So. Then B(xj,r) is a continuity set of P for all but at 
most countably many values of r. Choose r; with ¢/3 < r; < e/2 such that 
B(xj, rj) is a continuity set of P foreach j. The continuity sets form an algebra 
(RAP, Proposition 11.1.4). Let 


U; = Baj rp UJO : dai y) < ri) 


i<j 


Then U; are disjoint open continuity sets of diameters < ¢ with ae PU) = 
1, so there isa J < œo with Sy P(Uj)>1-e. 


Now to continue the proof of Theorem 3.24, for each k = 1,2,..., by 
the last Lemma take disjoint open continuity sets Uz; := U(k, j) of P for 
J =1,2,..., Jk = J(k) < œœ, with diam(U;;) < 1/k, P(U;;) > 0, and 


J(k) 
XO P(x) > 1-2. (3.6) 


j=l 


For any open set U in S with complement F, let d(x, F) := inf{d (x, y): y € 
F}. For r = 1,2,..., let F, := {x : d(x, F) > 1/r}. Then F, is closed and 
F, t U asr — o. There is a continuous h, on S withO < h, < 1, h, = 1 on 
F, and h, = 0 outside F>;: let h, (x) := min(1, max(0, 2rd(x, F) — 1)). 

For each j and k, let F(k, j) := S \ Ugj. Take r := r(k, j) large enough so 
that P(F(k, j),) > (1 — 27%) P(U;). Let hgj be the h, as defined above for 
such an r and Ay; the hz. 
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For n large enough, say for n > ng, we have 


/ hij(fr)d Qn > (1 — 27") P(U;j) and J Hij fdd Qn < (1 + 2~) P (Ur) 


forall j = 1,..., Ją. We may assume nı < nm < +. 
For every n =0,1,..., let fkjn := (hgj © fn)» for Qn, so that by Theorem 
3.3, since hg; > 0, we can assume that fkjn > 0 everywhere, 


1 iti f pO, 02 2, 


and fkjn is A,-measurable. For n > 1 let Byjn := { fkjn > 0} € An. Let Bjo : 
fo (Uk) € Ao. For each k and n, the Bgjn C fy (Ux) are disjoint for j = 
1,..., Jk, and Ayj( fn) = 1 on Bkjn, so for n > nx, 


(1 — 2“) P(U gj) < Qn(Bijn) < 1 +27™%) PUn), and Qo(Byjo) = P(x). 

(3.7) 
Let T, := X, x I,.Let u, be the product law Q, x à on T, where À is Lebesgue 
measure on the Borel o-algebra B in J. For each k > 1, n > ng, and j = 
1,..., Jg, let 


Ckjn := Byjn X (0, F(R, j,m)] C Th, 
Dkjn := Bxjo X [0, G(k, j,n)] C To, 
where F and G are defined so that 
Un(Ckjn) = HolDkjn) = min(Qn(Bkjn), Qo( Bxjo)). 


Then for each k, j, and n > ng, we have by (3.7), since 1/(1 + 2) s1- 
a 


1—27 < min(F, G)(k, j,n) < max(F,G)(k,j,n) = 1. (3.8) 


Let 
Jk) Jk) 
Cron = Ta \ (J Cops. Dron = To \ (J Dijn- 
j=1 j=1 
For k = 0 let Jo := J (0) := 0, Coon := Ta, Doon := To, and no := O. 
For each n = 1,2, ..., let k(n) be the unique k such that ng < n < ng41. 


Then for n> 1, T, is the disjoint union of sets Wy; := Ckmjn, j = 
0,1,..., Jem. We also have 


If j> 1 and (v,s) € W,;, then v € Bynyjn SO falv) E€ Ukmj- (3.9) 


Next, for each n, Ty is the disjoint union of sets Enj := Dynyjn for j = 
0,1,..., Jka. Then un(Wnj) = uo(Enj) for each n and j, and if j > 1 or 
k(n) = 0, then uo(En;) > 0. 
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For x in Tọ and each n, 
let j(n,x) be the j such that x € E,j. (3.10) 


Let L := {x€ To: Mol Enjin,x)) > 0 for all n}. Then Tp \ LC U; Ena for 
some (possibly empty or finite) sequence n(i) such that uo(Enao) = O for all 
i. Thus uo(L) = 1. 

For x € L and any measurable set B C T,, in other words B is in the product 
o-algebra A, Q B, recall that yn (Wnj) = uo(Enj) and let 


Pyj(B) = Un(B N Waj) /HolEnj), Pax [= Fnj(n,x)- (3.11) 


Then P,» is a probability measure on A, ® B. Let p, be the product measure 
We, Pax on T := IIX T, (RAP, Theorem 8.2.2). 


n=1 


Lemma 3.26 For any measurable set H CT (for the infinite product o- 
algebra with A, ® B on each factor), x +> px(ĦH) is measurable on (To, Ap Q 
B). 


Proof. Let H be the collection of all H for which the assertion holds. Given 
n, P, x is one of finitely many laws, each obtained for x in a measurable subset 
E,,;. Thus if Y, is the natural projection of T onto T, and H = Y; (B) for 
some B € A, Q B then H € H. 

If H = (iem Yra (Bi) where B; € Ama) ® B and M is finite, we may 
assume the m(i) are distinct. Then 


px(H) = Tiem PY poB), 


so H € H. Then, any finite, disjoint union of such intersections is in H. Such 
unions form an algebra. If H, € H and H, + H or Ha} H, then H € H. As the 
smallest monotone class containing an algebra is a o -algebra (RAP, Theorem 
4.4.2), the Lemma follows. 


Now returning to the proof of Theorem 3.24, Q = Tọ x T. For any prod- 
uct measurable set C C Q and x € To, let C, := {y € T : (x, y) € C}, and 
Q(C) := f P(Cx)duo(x). Here x +> p(Cx) is measurable if C is a finite 
union of products A; x F; where A; € Ao ® B and F; is product measurable 
in T. Such a union equals a disjoint union (RAP, Proposition 3.2.2). Thus by 
monotone classes again, x +> ,(C,) is measurable on To for any product mea- 
surable set C C Q. Thus Q is defined. It is then clearly a countably additive 
probability measure. 

Let p be the natural projection of T, onto X,,. Recall that P,, = P,; for 
all x € Enj, by (3.10) and (3.11). The marginal of Q on X,, in other words 
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Q o g7 !, is by (3.11) again 


nm. >? 


J(k(n)) 


> Mo(Enj)Pnj © P = Uno po! = Qn. 
j=0 


Thus Q has marginal Q, on X, for each n as desired. 

By (3.6), 21 Qo(Xo\ UF A fo (Uu) < E2 < co. So Qo-almost 
every y € Xo belongs to eae fo (Un) for all large enough k. Also if t € Ip 
andt < 1, then by (3.8), t < G(k, j,n) forall j > 1 as soonas 1 — 2-* > tand 
n > nx. Thus for uo-almost all (y, t), there is an m such that (y, t) € ue Enj 
for all n > m. If x := (y,t) € En; for j > 1, then y € Bkn)jo, so foly) € 
Uknj. Also, by (3.11), Pax = Paj is concentrated in W,j. For (v, s) € Wy; 


frlv) € Ung; by (3.9). Since diam(U;;) < 1/k for each j > 1, 


O*(d( fu(8n)s fo(8o)) > 1/k(n) for some n > m) 


< po(y, t) € Eno for some n > m}) > 0 


as m — ©, SO fr(8n) > fo(go) almost uniformly. 

Lastly, it will be shown that the g, are perfect. Suppose Q(A) > 0 
for some A. Now Q(A) = f p,(A,)duo(x). First let n > 1. Then for some 
X, Px(Ax) > 0. If uo(Eno) = 0, we take x ¢ Eno. Now T = T, x T™ where 
T® = IMi<iznTi. Then, on T, px = Pax X Qnx for a law Qay = Unga Pmx 
on T™®. Let A(x) := Ay. By the Tonelli—Fubini theorem, 


px(Ax) = SS lagu, v)d Pyx(u)d Qnx(v). 


Thus for some v, f 14œ)(u, v)d Pax(u) > 0. Choose and fix such a v as well as 
x. Now Pax = Paj for j = j(n, x) with uo(Enj) > 0. Let u = (s, t), S € Xn, 
and t € I,. Then since P,; = Qn x À restricted to a set of positive measure and 
normalized, 


0 < ff laws, t, v)dQ,(s)dt. 


Choose and fix a t with 


0 < S lags, t, v)dQ,(s). 


Let C := {s € X,: (s,t, v) € Ay}. Then Q,(C) > 0. Clearly C C g,[A], so 
8n is perfect for n > 1 by Theorem 3.18. 

To show go is perfect, we have uo = Qo x A, and p,(A,) > 0 for x = (y, t) 
in a set X with uo(X) > 0. There is a t € Jp such that Qo(C) > 0 where 
C := {y : (y,t) € X}. Then C C go[A], so go is perfect, finishing the proof of 
Theorem 3.24. 
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3.6 Conditions Equivalent to Convergence in Law 


Conditions equivalent to convergence of laws on separable metric spaces are 
given in the portmanteau theorem (RAP, Theorem 11.1.1) and metrization 
theorem (RAP, Theorem 11.3.3). Here, the conditions will be extended to 
general random elements for the theory being developed in this chapter. 

For any probability space (Q, A, Q) and real-valued function f on Q let 
E*f:=f* fdP, E,f := Jf, fdP. Mf (S,d) is a metric space and f is a real- 
valued function on S, recall (RAP, Section 11.2) that the Lipschitz seminorm 
of f is defined by 


If llc = supl f — fOI/d@, y): x Fy}, 


and f is called a Lipschitz function if || f ||, < oo. The bounded Lipschitz norm 
is defined by || f llez := lI fllz + II fllsup where || f lisup := sup, | f(x). Then 
f is called a bounded Lipschitz function if || f||az < œœ, and ||: || 5z is a norm 
on the space of all such functions. 

The extended portmanteau theorem about to be proved is an adaptation of 
RAP, Theorem 11.1.1 and some further facts based on the last section (Theorem 
3.24). The proof to be given includes relatively easy implications, some of which 
consist of putting in stars at appropriate places in the proofs in RAP. 


Theorem 3.27 For any metric space (S, d)andn =0,1,..., let(Xn, An, Qn) 
be a probability space and f, a function from X, into S. Suppose fo has 
separable range Sy and is measurable. Let P := Qoo Fa on S. Then the 
following are equivalent: 


(a) f> fo; 

(a') limsup,,_,.. E*G( fa) < EG(fo) for all bounded continuous real-valued 
Gon S; 

(b) E*G( fa) > EG( fo) as n > œ for every bounded Lipschitz function G 
on S; 

(b') (a') holds for all bounded Lipschitz G on S; 

(c) sup{|E*G( fa) — EG(Po)l : IGllsz < 1} > Oasn — œ; 

(d) For any closed F C S, P(F) > lim supp soo Q} fa € F) 

(e) For any open U C S, P(U) < lim infrsoolQn):(fa € U); 

(f) For any continuity set A of P in S, Qi (fn € A) > P(A) and (Qn)x(fn € 
A) > P(A)asn > œ; 


(g) There exist a probability space (Q, S, Q) and measurable functions g, from 
Q into X, and h, from Q into S such that the g, are perfect, Q o g7! = Qn 
and Q o h}! = P for alln, and d( fa © 8n, hn) —> 0 almost uniformly. 
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Moreover, (g) remains equivalent if any of the following changes are made 
in it: “almost uniformly” can be replaced by “in outer probability”; we can 
take hy, = fo © Yn for some measurable functions y, from Q into Xo, which can 
be assumed to be perfect; and we can take y, to be all the same, yn = yı for 
all n. 


Proof. Clearly (a) implies (a’). Conversely, interchanging G and —G, (a’) 
implies 


lim inf E*G(f,) > liminf E,G(f,) > EG(fo), 
n—->oo n—->O©o 


and (a) follows, so (a) and (a’) are equivalent. 

Clearly (a) implies (b), which is equivalent to (b’) just as (a) is to (a’). To 
show (b) implies (c), let T be the completion of S. Then all the f, take values 
in T. Each bounded Lipschitz function G on S extends uniquely to such a 
function on T, and the functions G o f„ on X, are exactly the same. So we can 
assume in this step that S and Sọ are complete. 

Let £ > 0. By Ulam’s theorem (RAP, Theorem 7.1.4), take a compact K C 
So with P(K) = Qo(fo E K) > 1 — e. Recall that d(x, K) := inf{d(x, y) : 
y € K} and K® := {x : d(x, K) < €}. Let g(x) := max(0, 1 — d(x, K)/e) for 
x € S. Then g is a bounded Lipschitz function with ||el|sz < 1+ 1/e < oo. 
Clearly 1g < g < lge. Since Exg( fa) > Eg(fo) as n > œœ, we have for n 
large enough 


(On)x(fn € K) = Exg( fn) > E8lfo)— £ > 1 —2e. (3.12) 


Let B be the set of all G on S with ||G||g~ < 1. Then the functions in B are 
uniformly equicontinuous, so the set of restrictions of functions in B to K is 
totally bounded for the supremum distance over K by the Arzela—Ascoli theo- 
rem (RAP, Theorem 2.4.7). Let G1, ..., G z for some finite J be functions in B 
such that for each G € B, sup,ex |(G — G;)(x)| < £ for some j = 1,..., J. 
Next, for any G € B, choose sucha j. Then 


|E“G(fn) — EG(fo)| < |E*G( fn) — E*G (fr) 
+ |E*(Gj(fn)) — E(Gj(fo))| + |E(Gj( fo) — GI, 
(3.13) 


a sum of three terms. For the last term, splitting the integral into two parts 
according as fo € K or not, the first part is bounded by € and the second by 2e 
since |G; — G| < 2 everywhere and Qo(fo ¢ K) < €. 

The middle term on the right side of (3.13) is bounded above by 


max |E*G (fn) — EG;(fo)l, 


which is less than € for n large enough by (b). 
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The first term on the right in (3.13) is a sum of two parts, one over a 
measurable subset A, of X, where fa € K* with Q,(A,) > 1 — 2e by (3.12), 
and the other over the complement of A,. Since G, G; € Band |G —Gj| < € 
on K, we have |G — G;| < 3e on K®. It follows by Lemma 3.4(a) that on 
An, (G o fn)*(< Gj o fa) + 3e, and likewise (Gj; o f,)*(< Go f,)* + 3e, so 
(Go fn)* — (Gj o fa)*| < 32 on Ay, and the first part is bounded above by 
3e. The second part is bounded by 2(1 — Q,(A,)) < 42. So for n large enough 
the left side of (3.13) is less than 11e uniformly for G € B and (c) follows. 

Clearly (c) implies (b), so (b) and (c) are equivalent. Next, (b) implies (d) 
by taking bounded Lipschitz functions gą decreasing down to 1p as k > oo, 
specifically the functions g in the proof of (b) implies (c) with F in place of K 
and ¢ = 1/k. 

Next, (d) and (e) are equivalent by taking complements; (d) and (e) together 
easily imply (f). 

For any bounded continuous function G, all but countably many of the 
sets {G < t} are continuity sets of P. It follows that G can be approximated 
uniformly within ¢ by a simple function }` t;14, where A; are continuity sets 
{t;_1 < G < ti}. So (f) implies (a), and (a) through (f) are all equivalent. 

Theorem 3.24 says that (a) implies (g), in the strongest form where h, = 
hı = foo yı and y; is perfect. 

The weakest form of (g), with convergence in outer probability and h, 
depending on n and not necessarily a composition of fọ with any (perfect) 
function, will be shown to imply (b’). Given ¢ > 0, take n large enough so that 


On{d( fn O &n, hn) > E} < & 


So on a measurable set A, with Q(A,) > 1 — £, we have d( fn © 8n, hn) < €, 
and so for any G € B (in other words ||G||gz, < 1) we have 


|GE) — Gin)| < ela, +2- Lae, 
SO 
E*G(fn(8n)) < EG(h,)+3e = EG(fo) + 3¢ 


(since h, has the same distribution as fo) and 


E*G(fn(8n)) = ECG o fn) o 8n)") = ECG o fn)” © 8n) 


by Theorem 3.3 and since g, is perfect, and where all the expectations are with 
respect to Q. The latter integral by the image measure theorem (RAP, Theorem 
4.1.11) equals E*(G(f,,)) (for Qn). Letting ¢ 0, (b’) follows, so all of the forms 
of (g) are equivalent to any of (a) through (f). 


Theorem 1.7, in a special case, proved a form of convergence which in that 
case is now easily seen to imply the conditions in Theorem 3.27: the one in (c), 
for example. 
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It will be shown that convergence in law is also equivalent to convergence 
in some analogues of the Prohorov and dual-bounded-Lipschitz metrics which 
metrize convergence of laws on separable metric spaces as shown in RAP, 
Theorem 11.3.3. We will have analogues of metrics, rather than actual metrics, 
because of the non-symmetry between the nonmeasurable random elements fn 
and limiting measurable random variable fo. 


Definitions. Let (Xm, Am, Qm) be probability spaces, m = 0, 1, and (S, d) a 
metric space. Let fm be functions from Xm into S, m = Q, 1, such that fo is 
measurable and has separable range. Let P := Qp o Ts Then let 


BC fo) = sup{|E*G(fi) — EGC): Gllaz < 1}. 


Define the extended Prohorov distance o( fı, fo) as the infimum of all € > 0 
such that P(F) < (Q).(fi € F°) + € for every nonempty closed set F C S. 


In the following theorem, B( fin, fo) and e( fin, fo) are defined for any m = 
1,2,..., just as form = 1. 


Theorem 3.28 For any metric space (S, d), probability spaces (Xm, Am, Qm), 
m=0,1,2,..., and functions fm from Xm into S, where fo has separable 
range and is measurable, the following are equivalent: 


(i) n> fo: 
(ii) B( fm, fo) > 0 as m —> oo; 
(iii) p( fm, fo) > Oas m > œ. 


Proof. (i) and (ii) are equivalent as (a) and (c) in Theorem 3.27. To show that 
(ii) implies (iii), we have: 


Lemma 3.29 For any fı and fo as in the statement of Theorem 3.28, 


efi, fo) < min(1, (2B(fi, fo))'/”). 


Proof. Let F be any closed set in S and P= Qoo des Given 0 < 
e <1, let g(x) := max(0, 1 — d(x, F)/e) as before. Then ||g|lgz < 1+ 1/8, 
and (Oi fi € F°) = fx g(fddQ: = Se(fo)\dQo—- Bl fl +4) = 
P(F)— e if (fi, fo) < e72. Clearly o(fı, fo) < 1 in all cases, so for 
e := (2P (fi, fD! when that is < 1, the Lemma follows. 


Now to continue the proof of Theorem 3.28, applying the Lemma to fm in 
place of fı for each m shows that (ii) implies (iii). To show that (iii) implies 
(i), let U be an open set in S. For k = 1,2,..., let Fk := {x : d(x, y) < 
1/k implies y € U}. Then Fx are closed sets and Fy + U as k > oo. For n 
large enough, (fn, fo) < 1/k and then P(Fe) < (Qn)s(fn € Fl) +4 < 
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(Qn)a( fn E U) + E, SO 


1 
P(Fk) < liminf(Qn)s( fn € U) + p 
n—> o 
Then letting k —> oo gives 


PU) < liminf(Qn).(fn € U). 


This is (e) of Theorem 3.27, so the proof of Theorem 3.28 is complete. 


Next is a version of Theorem 3.28 where we have two indices. 


Theorem 3.30 Let (S, d) be a metric space and (Q, S, Q) a probability space. 
Suppose that for each m,n = 1,2, ..., fmn is a function from Q into S, and 
fo is a measurable function from Q into S with separable range. Then the 
following are equivalent: 


(i) finn => fo, ie. for every bounded continuous real function G on S, 


f G( fm, n)dQ > f cao as m,n — oo; 


(ii) B fmn, fo) => Oas m,n —> O; 


(iii) Pl fan fo) > Oas m,n > OO. 
Proof. For a double sequence amn of real numbers, let 


lim SUp amn := inf sup amn, liminf amn := sup inf amn. 
m,n—=> o0 k m>k,n>k m,n—->oo k mzk,nzk 
Then the steps in the proof of Theorem 3.27 showing that (a) through (f) 
are equivalent extend directly to convergence as m —> oo and n > ov, if 
lim sup,_, 59 18 replaced by limsup,, noo (not by lim sup,,_, oo lim sup,_,,, or 
the iterated lim sup in the reverse order, which may be different), and likewise 
for lim inf. Thus, the proof of Theorem 3.28 also extends. 


We have the following; 


Theorem 3.31 (Continuous mapping theorem) Under the conditions of The- 
orem 3.28, if (T, e) is another metric space, G is a continuous function from S 


into T, and fm => fo, then G( fin) => G( fo). 


Proof. This follows directly from the definition of convergence in law >. 


The next fact is straightforward: 


Proposition 3.32 Suppose fm, m > 0, are measurable random variables tak- 
ing values in a separable metric space S, so that laws L( fm) exist on the Borel 
o-algebra of S. Then convergence fm => fo is equivalent to convergence of the 
laws LU fm) > L( fo) in the usual sense. 
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3.7 Asymptotic Equicontinuity 


Recall from Section 3.1 the definitions of the empirical measures P,,, empirical 
process v,, pseudometric pp, and the Gaussian process G p. Recall that a set 
F C L°(P) is called pregaussian if a G p process restricted to F exists whose 
sample functions f > Gp(f)(@) are (almost) all bounded and uniformly con- 
tinuous for pp on F. Such a Gp process is called coherent, if in addition, its 
sample functions are prelinear on F as in Lemma 2.30. By Theorem 3.2 we 
can and will assume that on a pregaussian F, Gp is coherent. 


Proposition 3.33 For any probability measure P and pregaussian set F C 
L>(P), the symmetric convex hull sco(F) of F is also pregaussian and there 
exists a Gp process defined on the linear span of F and constant functions 
which is 0 on the constant functions and coherent on sco(F). 


Proof. Apply Theorems 3.1 and 2.32. 


For a signed measure v and measurable function f such that f fdv is 
defined, let v( f) := f fdv. 

A class F of functions will be said to satisfy the asymptotic equicontinuity 
condition for P and a pseudometric t on F, or F € AEC(P, t) for short, if 
for every £ > 0 there is a ô > 0 and an ng large enough such that for n > no, 


Pr*{sup{lun(f —28): fig EF, tifa) <8} >e} <6. (3.14) 
Then F € AEC(P) will mean F € AEC(P, pp). 


Theorem 3.34 Let F C L7(X, A, P). Then the following are equivalent: 

(I) F is a Donsker class for P, in other words F is P-pregaussian and v, =G p 
in L*(F); 

(II) (a) F is totally bounded for pp and (b) F satisfies the asymptotic equicon- 
tinuity condition for P, F e AEC(P); 

(III) There is a pseudometric t on F such that F is totally bounded for t and 
F e AEC(P,T). 


Proof. (I) implies (II): if F is a Donsker class and ¢ > 0, then since F is 
pregaussian, it is totally bounded for pp by Theorem 3.1 and the definition of 
GC-sets, so (a) holds. Take ô > 0 small enough so that for any coherent G p 
process, 


Pr{sup{|Gp(f) — Gre(g)|: pp(f, 8) < ô} > &/3} < @/2. 


By almost uniformly convergent realizations (Theorem 3.24) defined on some 
probability space, for each n > no large enough we can assume Pr*{||v, — 
Gpllz = €/3} < €/2. If lva — Gell < €/3 and |Gp(f) — Gp(g)| < £/3, 
then |v, (f) — vn(g)| < £. So the asymptotic equicontinuity holds with the ô 
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and no chosen, and (b) holds. (More precisely, it holds on the original probability 
space because the functions g, in Theorem 3.24 are perfect.) Thus (I) implies 
(ID. (ID implies (III) directly, with t = pp. 

To show (III) implies (I), suppose F is t-totally bounded and belongs to 
AEC(P, Tt). Let UC := UC(F) denote the set of all real-valued functions on 
F uniformly continuous for t. Then UC is a separable subspace of €°(F) for 
|| - || z since F is totally bounded, and we in effect have the space of continuous 
functions on the compact completion (RAP, Corollary 11.2.5). 

For any finite subset G of F, by the finite-dimensional central limit theorem 
(RAP, Theorem 9.5.6) we can let n — oo in (3.14) and get 


Pr*{sup{|G p(f) — Gp(g)|: f.g €G, th, g) < bd} >eE} <e. 


Letting G increase up to a countable t-dense set H C F, we get by monotone 
convergence 


Pr*{sup{|Gp(f) — Gre(g)|: fig €H, th g) < 6} >} < «. 


We can let e}0 and ô = 6(¢){0 and conclude that G p is almost surely uniformly 
continuous for t on H. For each f € F, considering also the countable set 
{ f} UH, on the set of probability 1 where G p is t-uniformly continuous, the 
limit G} (f) = lim{Gp(h): h € H, t(h, f) > 0} exists, and equals Gp(f) 
almost surely. It follows that we can take G p to be uniformly continuous for t 
on all of F. 

Recall (Section 3.1) that zo(f) = f — f f dP forany f € L?(P). We have 
Va( f) = Vn(o(f)) and a.s. Gp(f) = Gp(mo(f)). For the pseudometric t, if 
t(f, g) = 0 for some f, g € F, then ro( f) = zole); in other words f — g must 
be a constant. It suffices to consider the family mo(F) := {m0(f): f € F}in 
place of F. On mo( F), p is a metric. So by Theorem 2.57, since G p is isonormal 
for (-, -)o,p it follows that F is pregaussian. So Gp has a law u3 defined on 
the Borel sets of the separable Banach space UC, in view of Theorems 3.1 and 
2.32 (a) implies (h). 

Given € > 0, take 6 > 0 from (3.14) and a finite set G C F such that for 
each f € F there is a g € G such that t(f, g) < 5. Then RÝ is the set of all 
real-valued functions on G. Let u2 be the law of Gp on RY and let [423 be the 
law on RY x UC where Gp on G in RY is just the restriction of Gp on UC. So 
[23 has marginals u2 and u3. 

Let j1,, be the law of v, on G, so Hı,» is also defined on RY. Then by the 
finite-dimensional central limit theorem again, the laws 41 ,n converge to u2 
on RY. So for the Prohorov metric p, since it metrizes convergence of laws 
(RAP, Theorem 11.3.3), p({1.n, u2) < € for n large enough. Take n > no also, 
then fix n. By Strassen’s theorem (RAP, Corollary 11.6.4), there is a law m12 on 
RY x RY with marginals u1, „n and u2 such that 1 2{(x, y) : |x — y| > e} < e. 
By the Vorob’ev—Berkes—Philipp theorem (1.31), there is a Borel measure 
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[4123 ON RY x R¥x UC having marginals 412 and u23 on the appropriate 
spaces. 

The next step is to link up v, with its restriction to the finite set G. The 
Vorob’ev—Berkes—Philipp theorem may not apply here since v,, on F may not be 
in a Polish space, at least not one that seems apparent. (About nonmeasurability 
on nonseparable spaces see the remarks at the end of Section 1.1.) Here we can 
use instead: 


Lemma 3.35 Let S and T be Polish spaces and (Q, A, P) a probability space. 
Let Q bea lawon S x T with marginal q on S. Let V be a random variable on Q 
with values in S and law L(V) = q. Suppose there is a real random variable U 
on Q independent of V with continuous distribution function Fy. Then there is 
a random variable W : Q > T such that the joint law L(V, W) of (V, W) is Q. 


Proof. Every Polish space is Borel-isomorphic to some compact subset of 
[0, 1], either the whole interval, a finite set, or a convergent sequence and its 
limit (RAP, Theorem 13.1.1). Since the lemma involves only measurability and 
not topological properties of the Polish spaces we can assume S = T = [0, 1]. 
Recall that for any real-valued random variable X with continuous distribution 
function F, F(X) has a uniform distribution in [0, 1]: Proposition 1.22. 

So taking Fy(U), we can assume U is uniformly distributed in [0, 1]. 
By way of regular conditional probabilities (RAP, Section 10.2; Bauer, 
1981) we can write Q = f Q,dq(x) where for each x, Qx is a probabil- 
ity measure on T, so that for any measurable set A in (the square) S x T, 
Q(A) = f f la(x, y)d Qx(y)dq(x) (RAP, Theorems 10.2.1 and 10.2.2). Let F, 
be the distribution function of Q, and 


FOG) := inffu: F(u) >t}, O<t <1. 


Then for any real z and 0 <t¢ < 1, F(A) < z if and only if F,(z) > t. Now 
x |> F,(z) is measurable for any fixed z. It follows that x > FO is mea- 
surable for each t, 0 < t < 1. For each x, F7! is left continuous and non- 
decreasing in ¢. It follows that (x, t) => F7! (t) is jointly measurable. Thus 
for W (œ) := Fy) U(@)), w > W(qw) is measurable. For each x we have 
the image measure À o (F7)! = Q, (RAP, Proposition 9.1.2). So for any 
bounded Borel function g, 


[sao 


1 pl 
f | g(x, y)dQ,(y)dq(x) by Theorem 10.2.1 of RAP 
0 Jo 


1 pl 
f f g(x, F7! (y))dy dq(x) by the image measure theorem 
o Jo 


1 pl 
i I g(x, FL ydq x A)(x, y) (Tonelli—Fubini theorem) 
0 JO 


E(g(V, Fy'(U))) = Eg(V, W), 


14:38 


P1: KpB Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


CUUS2019-03 


CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


162 3 Definition of Donsker Classes 


since U is independent of V and £(V) = q and by the image measure theorem 
again. So L(V, W) = Q, proving Lemma 3.35. 


Now given e€ > 0, let (Q,S, Q) be a probability space on which all the 
empirical processes v, and an independent U are defined, specifically a count- 
able product of copies of the probability space (X, A, P) times one copy of 
[0, 1] with Lebesgue measure A. Then Lemma 3.35 applies to Q with S = RY 
and T = R? x UC, where V is v, restricted to G, and Q = uiz on S x T. On 
Q we then have processes v, and G p defined on F, which by construction and 
the t-asymptotic equicontinuity condition (3.14) are within 3e of each other 
uniformly on F except with a probability at most 3e. 

Let £ |0 through the sequence £ = 1/k,k = 1, 2,....Let the approximation 
just shown hold for n > nx ona probability space (Qx, Sk, Qx). We can assume 
nx is nondecreasing in k. Let Agn be the v, process defined on Q; and Gkn 


the corresponding Gp process on Qg. Let no := 1 and let (Qo, So, Qo) be a 
probability space on which v, processes Ag, and Gp processes Go, := Go 
are defined, with Go independent of the Ao, processes. Let A, := Agn and 


Gn := Gp if and only ifn, <n < ng} fork = 0, 1, .... Then for all n, A, 
is a v, process and G, is a Gp process. On the probability space (Q’, Q’) := 
(eso 2x: [ps9 Q0), all A, and G, are defined and ||A,, — G,,||- —> 0 in outer 
probability, so by Theorem 3.27, v,=> Gp on F and (II) implies (1), proving 
Theorem 3.34. 


3.8 Unions of Donsker Classes 


It will be shown in this section that the union of any two Donsker classes F 
and G is a Donsker class. This is not surprising: one might think it was enough, 
given the asymptotic equicontinuity conditions for the separate classes, for a 
given £ > 0, to take the larger of the two no’s and the smaller of the two 6’s. 
But it is not so easy as that. For example, F and G could both be finite sets, 
with distinct elements of F at distance, say, more than 0.2 apart for pp, and 
likewise for G, but there may be some element of F very close to an element 
of G. So the equicontinuity condition on the union will not just follow from the 
conditions on the separate families. 

Given a probability measure P, F C L7(P), ¢ > 0, 5 > 0, and a positive 
integer no, it will be said that AE(7, no, €, 5) holds if for all n > no, 


Pr*{sup{|vn(f) = Vn(g)| : fg E Fr pref, 8) < ô} > £} < E. 


Then the asymptotic equicontinuity condition, as in the previous section, holds 
for F and P if and only if for every ¢ > 0 there is a ô > 0 and an nọ such that 
AE(F, no, £, ô) holds. The asymptotic equicontinuity condition, together with 
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total boundedness of F for pp, is equivalent to the Donsker property of F for 
P (Theorem 3.34). 


Theorem 3.36 (K. S. Alexander) Let (Q, A, P) be a probability space and 
let Fy and Fy be two Donsker classes for P. Then F := Fi U Fy is also a 
Donsker class for P. 


Proof. (M. Arcones). Given € > 0, take 6; > 0 and n; < oo such that for 
i= 1,2, AE(F;, ni, €/3, ôi) holds. Fı and F2, being Donsker classes, are 
pregaussian, and Gp is an isonormal process for (-, -)o,p, so F is pregaussian 
by Corollary 2.35. So there is an œ > 0 such that for a suitable version of G p, 


Pr{sup{|G p(f) — Gp(g)|: pel fg) <a, fg € F} > €/3} < £€/3. 


Let ô := min(6,, 62, 7/3). Take finite sets H; C F;, i = 1,2, such that for 
each i and f € F; there is an h:=1,f € Hi with pp(f,h) <6. Since 
H := Hı U Hz is finite, by the finite-dimensional central limit theorem (RAP, 
Theorem 9.5.6) v, restricted to H converges in law to Gp restricted to H. Let 
F(H, a, €/3) be the set of all y € R” such that |y(f) — y(g)| > £/3 for some 
f.g € H with pp(f, g) < a. Then F(H, a, €/3) is closed and has probability 
less than ¢/3 for the law of G p on R”. Thus by the portmanteau theorem (RAP, 
Theorem 11.1.1), there is an m such that 


AE(H, m, €/3, a) holds. (3.15) 


Let no := max(n1, n2, m). It will be shown that AE(F, no, £, 6) holds. By the 
asymptotic equicontinuity conditions in each F; and since nọ > max(nı, n2) 
and ô < min(6), 52), there is a set of probability less than ¢/3 for each i = 1, 2 
such that outside these sets, |v,(f) — v,(g)| < ¢/3 for any f, g in the same F; 
with pp(f, g) < ô. For pairs f, g with pp(f, g) < ô, with f € Fı and g € Fh, 
we have pp(T1 f, T28) < a since pp(f, 1 f) < 6, pp(g, T28) < ô, and 36 <a. 
Thus by (3.15), |va (tı f) — vn(t2g)| < €/3 for all such f, g, outside of another 
set of probability at most ¢/3. Thus 


DAEB) — Vn(g)| = va (f) = v(t P) + [vna f) = Vn(T28)| 
oF [vn (T28) — Vil g)| 
< ¢/3+6/3+6/3=6 


for all f € Fı and g € Az except on a set of probability at most ¢/3 + ¢/3 + 
é/3 =e. 


3.9 Sequences of Sets and Functions 


This section will show how the asymptotic equicontinuity condition in Theorem 
3.34 can be applied to prove that some sequences of sets and functions are 
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Donsker classes. In Chapters 6 and 7, other sufficient conditions for the Donsker 
property will be given that will apply to uncountable families of sets and 
functions. For two measurable sets A and B, let pp(A, B) := pp(l4, 1B). 


Theorem 3.37 Let (X, A, P) be a probability space and {Cm}m>1 a sequence 
of measurable sets. If 
oo 

XOP (Crh) — P(Cm))) < œ for some r < œ, (3.16) 

m=1 
then the sequence {Cm}m>1 is a Donsker class for P. Conversely, if the sets 
Cm are independent for P, then the sequence is a Donsker class only if (3.16) 
holds. 


Proof. Suppose (3.16) holds. Then the positive integers can be decomposed 
into two subsequences, over one of which P(C,,) — O and over the other 
P(Cm) — 1. It is enough to prove the Donsker property separately for each 
subsequence by Theorem 3.36. For any measurable set A with complement 
A‘, v,(A°) = —v,(A) and Gp(A‘°) = —Gp(A). The transformation of these 
processes into their negatives preserves convergence in law if it holds. So we 
can assume P(C,,){0asm — œ and then )*,, P(Cy,)" < 00. Also, {Cm}m>1 is 
totally bounded for pp. By Theorem 3.36 we can assume pm := P(Cm) < 1/2 
for all m. 

For any i and m such that P(C; AC,,) = 0, we will have almost surely for 
any n that P,,(C;) = P,(C,). So we can assume that P(C; AC,,) > 0 for all 
i 4m. 

For any m such that P(Cm) = 0 we will have P, (Cm) = 0 almost surely for 
any n and then v, (Cm) = 0, so we can assume P(Cm) > 0 for all m. 

Let 0 < £ < 1. Suppose we can find M and N such that for all n > N 


Pr | sup lVn(Cm)| > e] < E£. (3.17) 


m> M 


Then for J large enough, pm < pm/2 for m > J. Let y := min{P(C;AC; : 
1<i<j<J}, a:=min(y, pm)/2. Then forn > N 


sup{|Yn(Ci) — va (C;)| : P(CiAC;) < a} 
< sup{|y,(C)) — (Cj): i, j > M} < 2sup{|v,(C;)|: j 2 M} < 2e 


with probability at least 1 — £, proving the asymptotic equicontinuity condition. 
So it will be enough to prove (3.17). For that, recalling the binomial prob- 
abilities defined in Section 1.3, it will suffice to find M and N such that for 
n>N 
[e6] 
x E(npm + en! n, Pm) < €/2 (3.18) 


m=M 
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and 
[e6] 
XO B(npm — en'?,n, Pm) < 8/2. (3.19) 
m=M 


Let qm := 1 — Pm. For (3.19), the Chernoff inequality (1.6) gives 


B(npm = enl/? n, Pm) < exp(—£7/(2 Pmqm)). 


For some K, 1 < K < œ, Pm < Km!" for all m. Choose M large enough 
so that 


X` exp(—m'/"e?/(2K)) < e/2 


m=M 


to give (3.19) for all n. 
The other side, (3.18), is harder. Bernstein’s inequality Theorem 1.11 gives, 
if 2pn'/? > and0 < p < 1/2, with q := 1 — p, that 


E(np + en',n, p) < exp(—e"/(2pq + en™"™®)) < exp(—e°/(6pq)). 


Then 


oo 
> {E(npm + en'/?, n, Pm) : tone = E} 
m=M 


< J fexp(—e?/(pm)) : 2pmn'!? > e} (3.20) 


m=M 
CoO 

< 5 exp(—e?m!/" /(6K)) < &/4 
m=M 


for all n if M is large enough. 
It remains to treat the sum, which will be called S2, of (3.18) restricted to 
values of m with 


2pan? <e. (3.21) 


For p := Pm, inequality (1.7) implies 


1/2 


2 1/2 
en . 


E(np + en!,n, p) < (np/(np + en! PP" 


Let y := y(n, m, £) := n! pm/£, so y > 0. Let f(x) = (1 + x) log(1 + x7!). 
Then 


en!’ — (np + en”) log(1 + €/(pn') = en? — fO). 


14:38 


P1: KpB Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


CUUS2019-03 


CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


166 3 Definition of Donsker Classes 


For x > 0, f(x) <0. By (3.21), y < 1/2, so f(y) > f(1/2) > 3/2. Thus 
1 < 2f(y)/3, en'?1 — f(y) < —en'”? f(y)/3, and 


So 


IA 


X {expen + npm)[log( + £/(1"? pm))1/3) : 2Pmn'!? < e} 


m=1 


po 
N pnn P : pan? < e) 


m=1 


IA 


since 1/y < (1+ 4)'*”, so (1 + 5) (+ < y. Now since pm < Km7'/" we 
have S> < S3 + S4 where 


oe) 
S3 := N Wonn JP : 2Km! n!’ < e} 


m=1 


< (Kn'/2 Jez" P ` mE" 8n 


m>G 


where G := (Kn! Ja > 2. Then for any nı > (3r/e)? andn > ny, 
oe) 


S3 < Knyn f xe 1B gy 
G-1 


< (Kn? fe)" Pen! Bry! = 1G = 17167) 


For K, £ and r fixed and n —> ov, the logarithm of the latter expression is 
asymptotic to —en!/*(log 2)/3 —> —oo, so S3 > 0. Thus $3 < ¢/8 for n > m 
for some 7. Finally, 


[0,6] 
So) (nn fay: 2pun! < e < 2Km n") 


m=1 
< (2Kn' P Jey r” B + 0 


as n— oo, so S4<eé/8 for n>n3 for some n3. Thus for n> 
max(nı, n2, n3), So < £€/4. This and (3.20) give (3.18). So (3.16) implies 
that {Cm}m>1 is a Donsker class. 

For the converse, if the measurable sets {Cm}m>1 are independent for P 
and form a Donsker class for P, it will be shown that (3.16) holds. Note that 
for each n, P,(C,,) are independent random variables for m = 1,2,... . Let 
A:={m: Pm < 1/2}. Suppose that for all n, X nea Ph = +00. Then A is 
infinite. For each n, Pr{P,,(C,,) = 1 for infinitely many m} = 1 by the Borel- 
Cantelli lemma. Now P,(Cm) = 1 implies va (Cm) = n!/7(1 — pm) > n!/2 
for m € A. Similarly, if for all n, parr el — Pm)" = +00, then 


Pr(Pa (Cm) = 0 for infinitely many m ¢ A) = 1, 
and P,(Cm) = 0, m ¢ A, implies ./n(P, — PX Cm) < —J/n/2. 
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Let F := {1c,}m>1. AS F is countable, ||v||- and ||Gp||z are measur- 
able random variables. It has been shown that if either }- „e4 Ph = +00 
or È mga — Pm)" = +00 for all n, then ||v || F > /n/2 almost surely for 
all n. As F is Donsker and so pregaussian, there is an M < œo such 
that Pr(||Gp || F > M/2) < 1/4. For @ € £” (F) let Hy() := min(M, lloll). 
Then Hy is a bounded Lipschitz, thus continuous function on £°(F). We 
have EHy(Gp) < 3M/4. Forn > 4M?, EHy(v,) > M, a contradiction since 
V= Gp by definition of Donsker class. It follows that for some n large 


enough, 


Yi. = Pm)Pm)" < X ph + YOA = Pa)” < +00, 


meA m€A 


which gives (3.16). 


Next, we consider sequences of functions. For a probability space (A, A, P) 
and f € £L7(A, A, P) let o2(f) = f f°'dP — (f fdP) (the variance of f). 
Here is a sufficient condition for the Donsker property of a sequence {fn} 
which is easy to prove, yet turns out to be optimal of its kind: 


Theorem 3.38 Jf {fm}m>1 C L?(P) and yy o3( fn) < œ, then { fm}m>1 İS 
a Donsker class for P. 


Proof. We can assume that op( tm) > Q for all m since the set of fm with 0 

variance is clearly P-Donsker, and we can apply Theorem 3.36. Since v, and 

Gp are the same on fm — Efm as on fi, a.s. for all m, we can assume Ef, = 0 

for all m. Then fm —> Oin b so the sequence { fm} is totally bounded for pp. 
For any 0 < £ < 1, n > 1 andm > 1, by Chebyshev’s inequality 


Yo Pron fl = €/2) < 4 op fle? < e 


jz=m jzm 


for m > mo for some mg < oo. We have a.s. for all n, va ( fj) = Va( fx) for all j 
and k with ont — fx) = 0. Let 


a := inflop(f; — fx): op fi — fe) > 0, j< mo}. 


Then a > 0 since in some pp neighborhood of f; there are only finitely many 
fr. Let ô := min(a, 1). Then 


Prva Cfi) — Ynfo > £ forsome j,k with obl f; — fk) < 6} < e, 


implying the asymptotic equicontinuity condition and so finishing the proof by 
Theorem 3.34. 


The following shows that Theorem 3.38, although it does not imply the first 
half of Theorem 3.37, is sharp in one sense: 
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Proposition 3.39 Let A := [0,1] and P := U[0,1] := Lebesgue mea- 
sure on A. Let am > Q satisfy par am = +00. Then there is a sequence 


{fn} C £°(A, A, P) with o3( fin) < am for all m where { fm} is not a Donsker 
class. 


Proof. We can assume an 0. There exist cm 40 such that $`, dm Cm = +00 (see 
Problem 12). In A let Cm be independent sets with P(Cm) = GmCm for each m 
(see Problem 13). Let fn := Cm! 1c,.Then 02 (fin) <  f2dP = dm.Foreach 
n, almost surely P,(C,) > 1/n for infinitely many m. Then sup, Vn( fm) = 
+00 a.s., so the asymptotic equicontinuity condition fails and {fm} is not a 
Donsker class for P. 


3.10 Closure of Donsker Classes under Sequential Limits 


If (X, A) is a measurable space and F C £°(X, A), a function F € £°(X, A) 
will be called an envelope function for F iff for all x € X and f € F,|f(x)| < 
F(x). We have the following stability of the Donsker property under pointwise 
sequential limits: 


Theorem 3.40 Let (X, A, P) be a probability space, and F C L7(X, A, P) a 
Donsker class for P. Suppose F has an envelope F € L*(X, A, P). Let G be 
the class of all functions g : X — R such that for some gm € F, &m(x) > g(x) 
for all x € X. Then G is also P-Donsker. 


Proof. Clearly, G C £°(X, A, P). By dominated convergence using F € £?, 
we have G C L7(X, A, P). By Theorem 3.34, (a) implies (b), we have the 
asymptotic equicontinuity condition for F and pp. Let it hold for a given € > 0, 
ô > 0, and no, and take any n > no. Let f, g € G satisfy pp(f, g) < 6. Take fi, 
and gm in F with fi,(x) > f(x)and gm(x) > g(x) forallx.Then fm —> f and 
8m > gin L°(X, A, P). It follows that PP( fins &m) < ô for m large enough. 
Thus the supremum in the definition of the asymptotic equicontinuity condition 
for the given ô is actually the same for G as it is for F, so the condition holds 
for G with the same ô and ng for the given £. Likewise, G is totally bounded for 
pp. By Theorem 3.34, (b) implies (a), G is Donsker for P. 


3.11 Convex Hulls of Donsker Classes 


Let F be a class of real-valued functions on a set X. Recall that the symmetric 
convex hull of F is the set of all functions X`; c; f; for f; € F, ci € R, any 
finite m, and $`; |c;| < 1. If 0 < M < oo, let H(F, M) denote M times the 
symmetric convex hull of F. 
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Let H,(F, M) be the smallest class G of functions including H(F, M) such 
that whenever g, € G for all n and g,(x) > g(x) asn — œ for all x, we have 
g EG. 

We have the following: 


Theorem 3.41 Let (X, A, P) be a probability space and F C L?(P) a class of 
functions. Suppose F has an envelope function F in £L*(P). If F is a Donsker 
class for P, then so is H,(F, M) for any M. 


Proof. The process G p can and will be chosen to have a distribution u concen- 
trated on a separable subspace of £” (F). By Theorem 3.27, (a) implies (g), there 
is some probability space (Q, S, Q) and measurable functions g, from Q into 
X” and h, from Q into €°(F) and such that the g, are perfect, Q o eg =P 
Qo h7! = u for all n, and ||v, © g, — hn l| —> 0 almost uniformly, where v, 
is the function v, (x1, ..., Xn) = /J/n(P, — P) with P, = 1 Xa ôx, from X” 
into LY (F). Let v} := Va © gn. Let GY := n on Q, which is a version of Gp 
for each n since it has distribution u. So we have 


Iv, — GP lz > 0 (3.22) 


almost uniformly. H(F, 1) is pregaussian by Proposition 3.33, so each GY 
extends, using prelinearity (Theorem 2.32) to be uniformly continuous almost 
surely on H(F, 1). It is easily seen that in (3.22), the norm over F can be 
replaced by the norm over H(F, 1) without increasing it. We can apply Theorem 
3.27 (g) implies (a) since the perfect functions g,, are unchanged. So H(F, 1) is 
a Donsker class. We can multiply by M, multiplying the norms all by M so that 
we still have almost uniform convergence. We can take sequential pointwise 
limits by Theorem 3.40. This finishes the proof. 


Problems 


1. Let (Q, A, P) be a probability space, (S, d) a (possibly nonseparable) metric 
space and let x,, n =0,1,2,..., be points of S. Let f,(@) = x, for all æ. 
Show that fa= fo if and only if x, —> xo. Hint: Define H(x) = d(x, xo). Show 
that H is continuous, in fact, |H(x) — H(y)| < d(x, y), so H is Lipschitz. 
Let G(x) = min( H (x), 1). Then show that G is bounded and continuous, with 
G(x) —> 0 if and only if x — xo. Use G for one direction. 


2. In a general (possibly nonseparable) metric space show that if Xọ = pe S 
is a constant random variable then random elements X,=> Xo if and only if 
Xn — Xo in outer probability. Hint: Use a function G as in the last problem 
with p in place of xo. 
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3. Let (T, d) be any metric space and (Q, A, Q) a probability space. Let fn 
for n > 0 be functions from Q into T such that f,=> fo. Let gn for n > 1 be 
functions from Q into T such that d(g,, fa) — O in outer probability. Show 
that g,=> fo. Hint: Apply Theorem 3.27 (a) + (b’), using (a) in the hypothesis 
and (b’) in the conclusion. 


4. Let T be the set of all bounded real-valued functions on [0, 1] and C[0, 1] 
the space of all continuous functions on [0, 1]. Let the norm on T and its 
subspace C[0, 1] be || fllsup := supp<,<; | f(|. Suppose on some probability 
space (Q, A, P), Brownian bridge processes Y,, are defined for n = 0,1,..., 
having an arbitrary joint distribution, but such that for each, t œ> Y,,(t, œ) is 
continuous as a function of t € [0, 1]. 


(a) Show that Y = Yo. Hint: As all Y,, take values in the separable Banach space 
C[O, 1] and have the same distribution, it should not be hard to show that they 
converge in distribution (law). 

(b) Suppose that on (Q, A, P) we also have empirical processes @,(t), 0 < 
t < 1, for the U[O, 1] distribution, defined with their usual joint distribution. 
Show that C := {[0,t]: 0 < +t < 1} is a Donsker class for P; i.e., show that 
Qn=> Yo. Hint: Never mind what Y,, might have been in part (a). Instead choose 
Y„ based on the Koml6s—Major—Tusnady (—Bretagnolle—Massart) Theorem 1.8 
and use the result of Problem 3. 

(c) Show that for any probability measure P on the Borel sets of S = R, 
C = {(—o0, t] : t € R}isa Donsker class for P. Hint: Let F be the distribution 
function of P. One needs to compose the processes in part (b) with F. (This 
gives essentially “Donsker’s theorem” as M. D. Donsker stated it in 1954.) 


5. Let Ax be independent sets in a probability space (Q, A, P) such that 


DOIRADA = P(A,))I" = +00 


k 


for all n = 1,2,.... Show that {A,},>1 is not a Glivenko—Cantelli class, i.e. 
sup, |(P, — P)(A,)| does not converge to 0 in probability as n — oo. Hint: See 
the last part of the proof of Theorem 3.37. 


6. Let (A, A, P) be a probability space and let F C £7(A, A, P) be a Donsker 
class for P. 


(a) Show that the convex hull of F, namely co( F) := 


k k 
She GeF, Haig ag 20, Yah k=1,2,... 
j=l 


j=l 
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is a Donsker class. Hint: Use Theorem 2.32 to get that co(F) is pregaussian and 
Gp can be taken to be prelinear on co( F). Then use almost surely convergent 
realizations (Theorem 3.24). 

(b) For any fixed k < oo, show that oe fj: fie F for j=1,...,k}is 
also a Donsker class. Hints: Use induction and Theorem 3.36 on unions. Thus 
take k = 2. It is easy to show that 27 := {2f: f € F} is Donsker. Then 
apply part (a). 


7. Let c > 0. For Lebesgue measure à on [0, 1], the Poisson process with 
intensity measure cA is defined by first choosing n at random having a Poisson 
distribution with parameter c, so that P(n = k) = e~‘c*/k! for k = 0, 1, 

then setting Ye := Le ı Ox, Where X; are i.i.d. with law 4 on [0, 1]. Ti the 
Banach space of bounded finiog on [0, 1] with supremum norm, prove that as 
c — œ (along any sequence) the random functions t œ> (Y. — cà)c™ 1/2 ([0, t]), 
0 < t < 1, converge in law to the Brownian motion process x;, 0 <t < 1. 
(Recall that x, = L(1j0,1)).) Hints: For c = cg — œ let n = nx be Poisson (cx). 
For F(t) = t, 0< t <1, write Y,(t) := Y,({0, t]) = nF,(t), so 


TPY) — ct) = (n/c)? [n'?(F, — FEO] + n- ot. 


By Donsker’s theorem take Brownian bridges y™ such that n! (F, — F) is 
close to y for n large. Also, c~!/?(n — c) is close in distribution by the central 
limit theorem to a random variable Z. with distribution N(0, 1). Show one can 
take n(w) independent of X1, X2,..., then y,(w) := yw) is a Brownian 
bridge. Show one can get Z, to be independent of {y, }o<;<1. Then y, + Zet has 
the distribution of Brownian motion on [0, 1]. Apply the method of Problem 3 
as in Problem 4. 


8. Let P be a law ona separable, infinite-dimensional Hilbert space H such that 
J iix d P(x) < 00 and with mean 0, so that f(x, h)d P(x) = 0 for all h € H. 
Let X1, Xo,... bei.i.d. in H with law P and S, := Xi +--+ Xp. 


(a) Show that the central limit theorem holds in H, i.e. S,,/ ni/2 converges in 
law to some normal measure on H. Hint: Prove using variances that the laws 
of S,,/n'/? are tight. 

(b) Show that the class of functions x > (x, h) for h € H with ||h|| < 1 (the 
unit ball of the dual space of H) is a Donsker class of functions on H for P. 


Hints: Part (a): Let {e,} be ? orthonormal basis so that for any x € H, x = 
>, Xnen with |x|? = >>, x2. Thus >>, Ex? is finite. Show that for some 
Cn —> O slowly enough, > Ex?/c? is still finite. For any K let Cx be the set of x 
such that $- (xn/ cn)? < K? (an infinite-dimensional ellipsoid). Show that each 
Cx is compact and that the laws of S„/n!/? are uniformly tight using the sets 
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Cx. Thus subsequences of these laws converge. Show that they all converge to 
the same Gaussian limit law, since by the Stone-Weierstrass theorem, functions 
depending on only finitely many x; are dense in the continuous functions on 
each Cx. 


9. The two-sample empirical process. Let X1,..., Xm, Y1, ..., Yn beii.d. with 
the uniform distribution U[0, 1] on [0, 1]. Let F, be the empirical distribution 
function based on X1, ..., X and G, likewise based on Y),..., Y„. Show that 
in the space of all bounded functions on [0, 1] with supremum norm, 


1/2 
mn 
( ) (Fin z; Gn) = y 


m+n 


as m,n — œ where tt > y;, 0 <t < 1 is a Brownian bridge process. Hints: 
Extend the definition of convergence in law to two indices m,n both going 
to +00 and otherwise unrestricted. For m, n large, m! (En — F) is close toa 
Brownian bridge Y (m) and n'/?(G,, — F) is close to an independent Brownian 
bridge Z(n) by the Koml6s—Major—Tusnady—Bretagnolle—Massart theorem. It 
follows as in Problems 3 and 4 that [(mn)/(m + n)|'*(E,, — Gn) is close to 
(n/(m +n))!/?¥(m) — (m/(m + n))!/? Z(n), which is a Brownian bridge. 


10. Let f,(x) = cos(27nx) for 0 < x < 1 with law U[0, 1] on [0, 1]. For real c 
let F, be the sequence of functions n~° f, for all n = 1, 2,.... For what values 
of c is Fe pregaussian? A Donsker class? Hints: If c < 0, show easily that the 
class is not pregaussian and so not Donsker. If c > 0 then it is pregaussian, by 
metric entropy. If c > 1/2, itis Donsker by Theorem 3.38. Show it is Donsker 
for 0 < c < 1/2 by the Bernstein inequality and the asymptotic equicontinuity 
condition. 


11. In R, for any law P on R, show that for any fixed k < œ, C := 
(Uf (aj, bj] : aj <b; forall j}isaDonskerclass,i.e.F := {1c : C €C} 
is a Donsker class. Hint: Apply Donsker’s theorem as in Problem 4 and take 
differences (and sums). For k = 2 reduce to the case of disjoint intervals and 
apply Problem 6. Do induction to get general k.) 


12. Show that as stated in the proof of Proposition 3.39, for any am > 0 with 
Yon Gn = +00 there are Cm} 0 with 7, Cmam = +00. Hint: Take a sequence 
my such that )*{am : mg < m < mg41} > k foreachk = 1,2, .... Let Cm have 
the same value for mg < m < mg41. 


13. Show that as also stated in the proof of Proposition 3.39, in [0, 1] for 
P = U[0, 1] there exist independent sets Cm with any given probabilities. 
Hint: Use binary expansions. By decomposing the set of positive integers into 
a countable union of countably infinite sets, show that ([0, 1], P) is isomorphic 
as a probability space to a countable Cartesian product of copies of itself. 
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14. Suppose that {Cm }m>1 are independent for P, Aai P(Cm)A — P(Cm)) = 
+oo and cm — oo. Show that {cm 1¢,,}m>1 is not a Donsker class. 


15. Do f, in the proof of Proposition 3.21 converge in outer probability: to 
fo, to something else, or not to any function? Based on your answer, explain 
whether there is a contradiction with Corollary 3.17, and if not, why not. 


Notes 


Notes to Section 3.1. For any metric space (S,d), let B,(S,d)be the o- 
algebra generated by all balls B(x,r):={y: d(x, y) <r}, x eS, r>0. 
Then 6,(S, d) is always included in the Borel o-algebra B(S, d) generated by 
all the open sets, with 6,(S, d)= B(S, d) if (S, d) is separable. 

Suppose Y, are functions from a probability space (Q, P) into S, measurable 
for B,(S, d). Then each Y, has a law u, = P o Y7! on B(S, d). 

Dudley (1966, 1967b) defined convergence in law of Y,, to Yo to mean that 
JS“ Hd, > f Hd for every bounded continuous real-valued function H 
on S. Hoffmann-Jgérgensen (1984) gave the newer definition adopted generally 
and here, where the upper integrals and integral are taken over Q, not S, so 
that the laws jz, are not necessarily defined on any particular o-algebra in S. 
Hoffmann-Jgrgensen’s monograph was published in 1991, apparently without 
major revision (its latest reference is from 1981). 

Andersen (1985a,b), Andersen and Dobrić (1987, 1988), and Dudley (1985a) 
developed further the theory based on Hoffmann-Jgrgensen’s definition. 


Notes to Section 3.2. Blumberg (1935) defined the measurable cover function 
f*, see also Goffman and Zink (1960). (I thank to the late Rae M. Shortt for 
pointing out Blumberg’s paper.) Later, Eames and May (1967) also defined 
jf*. Lemmas 3.4 through 3.8 are more or less as in Dudley and Philipp (1983, 
Section 2), except that Lemma 3.6(c) is new here (in the first, 1999 edition). 
Theorem 3.3 and its proof are as in Vulikh (1961/1967, pp. 78-79) for Lebesgue 
measure on an interval (the proof needs no change). Luxemburg and Zaanen 
(1983, Lemma 94.4 p. 222) also prove existence of essential suprema and 
infima of families of measurable (extended) real functions. 


Notes to Section 3.3. This section was based on parts of Dudley (1985a). 


Notes to Section 3.4. Perfect probability spaces were apparently first defined by 
Gnedenko and Kolmogorov (1949), Section 3, and their theory was carried on 
among others by Ryll-Nardzewski (1953), Sazonov (1962), and Pachl (1979). 

Perfect functions are defined and treated in Hoffmann-Jgrgensen 
(1984,1985) and Andersen (1985a,b); see also Dudley (1985a). 
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Notes to Section 3.5. The existence of almost surely convergent random vari- 
ables with a given converging sequence of laws was first proved by Skorohod 
(1956) for complete separable metric spaces, then Dudley (1968) for any sepa- 
rable metric space, with a re-exposition in RAP, Section 11.7, and by Wichura 
(1970) for laws on the o-algebra generated by balls in an arbitrary metric space 
as mentioned in the notes to Section 3.1. The current version was given in 
Dudley (1985a). 


Notes to Section 3.6. Hoffmann-Jgérgensen (1984), who defined convergence 
in law in the sense adopted in this chapter, also developed the theory of it as in 
this section, and partly in a more general form (with nets instead of sequences, 
and other classes of functions in place of the bounded Lipschitz functions). 

Andersen and Dobrić (1987, Remark 2.13) pointed out that the portmanteau 
theorem (as in Topsøe, 1970, Theorem 8.1) “can be extended to the nonmeasur- 
able case. The proof of this extension is the same as the ordinary proof.” Much 
the same might be said of other equivalences in this section. Dudley (1990, 
Theorem A) gave a form of the portmanteau theorem and (Theorem B) of the 
metrization theorem 3.28. 

But not all facts or proofs from the separable case extend so easily: for 
example, in the separable case, there is an inequality for the two metrics, 
Ê < 2p, in the opposite direction to Lemma 3.29, which follows from Strassen’s 
theorem on nearby variables with nearby laws (RAP, Corollary 11.6.5), but 
Strassen’s theorem seems not to extend well to the nonmeasurable case (Dudley 
1994). 


Notes to Section 3.7. An early form of the asymptotic equicontinuity condition 
appeared in Dudley (1966, Proposition 2) and a later form in Dudley (1978). 
The equivalence with a different pseudometric t in Theorem 3.34 is due to 
Giné and Zinn (1986, p. 58). 

Lemma 3.35 is essentially contained in the proof of Skorohod (1976, Theo- 
rem 1), as Erich Berger kindly pointed out. See also ErSov (1975). 


Notes to Section 3.8. Alexander (1987, Corollary 2.7) stated Theorem 3.36 
but did not publish a proof of it, although he had written out an unpublished 
proof several years earlier. He says that the result is “an extension of a slightly 
weaker result of Dudley (1981),” where F> is finite, but this author himself 
doesn’t think his 1981 result was only “slightly weaker”! The proof presented 
was suggested by Miguel Arcones in Berkeley during the fall of 1991, but I 
take responsibility for any possible errors in it. Apparently van der Vaart (1996, 
Theorem A.3) first published a proof. 


Notes to Section 3.9. Theorem 3.37 first appeared in Dudley (1978, Section 
2), Theorem 3.38 in Dudley (1981), and Proposition 3.39 in Dudley (1984). 
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Vapnik—Cervonenkis Combinatorics 


This chapter will treat some classes of sets satisfying a combinatorial condition. 
In Chapter 6 it will be shown that under a mild measurability condition to be 
treated in Chapter 5, these classes have the Donsker property, for all probability 
measures P on the sample space, and satisfy a law of large numbers (Glivenko— 
Cantelli property) uniformly in P. Moreover, for either of these limit-theorem 
properties of a class of sets (without assuming any measurability), the Vapnik— 
Cervonenkis property is necessary (Section 6.4 ). 

The name Cervonenkis is sometimes transliterated into English as Chervo- 
nenkis. The present chapter will be self-contained, not depending on anything 
earlier in this book, except in some examples. 


4.1 Vapnik—Cervonenkis Classes of Sets 


Let X be any set and C a collection of subsets of X. For A C X let C4 := 
CNA:=ANC:={CNA: C €C}.Let card(A) :=|A| denote the cardinality 
(number of elements) of A and 24 := {B : B C A}. Let A°(A) := |Ca|. If 
ANC = 24, then C is said to shatter A. If A is finite, then C shatters A if and 
only if A(A) = 2/41, 

Let m°(n) := max{A°(F): F C X, |F|=n} for n=0,1,..., or if 
|X| <n let m? (n) := m? (|X|). Then m? (n) < 2” for all n. Let 


V(C) o> inf{n : m? (n) < 2h if this is finite, 
+oo, if m°(n) = 2” for all n, 

S(C) := sup{n : mn) = 2"), C £ Ø, 

~~ —1,  ifC is empty. 
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Then S(C) = V(C) — 1, and S(C) is the largest cardinality of a set shattered by 
C, or +œ if arbitrarily large finite sets are shattered. So, V(C) is the smallest n, 
if one exists, such that no set of cardinality n is shattered by C. If V(C) < ov, 
or equivalently if S(C) < œo, C will be called a Vapnik—Cervonenkis class or 
VC class. In the (very large) machine learning literature relating to VC classes, 
S(C) is called the VC dimension of C. 

If X is finite, with n elements, then clearly 2* is a VC class, with S(2*) = n. 

Let yCex := a (); where 


N\ _ {[NYGUN-P), j=0,1,...,N, 
i 0, j>N. 


Then yC< is the number of combinations of N things, at most k at a time. 
(In an older notation y Cx := (X ) .) “Pascal’s triangle” of identities for binomial 
coefficients extends to the yC <z: 


Proposition 4.1 yC<k = n-1C<k + nw—1Cex-1 fork =1,2,..., and N = 
1.2) cots 


Proof. For each j = 1,2,..., N, we have by the classical Pascal’s triangle 
N N-1 N-1 N N-1 
P = ; + í b and = = 1. 
j J j—-1 0 0 


Summing over j, the conclusion follows. If N < k, we get 2" = 2-7! + 
pA 


For a non-VC class C we have mî (n) = 2” for all n. For a VC class, the next 
fact, which is fundamental in the Vapnik-Červonenkis theory, will imply that 
mĈ (n) grows only as a polynomial rather than exponentially in n. 


Theorem 4.2 (Sauer’s Lemma) Jf m°(n) > ,C<x-1, where k> 1, then 
m°(k) = 2%. Hence if S(C) < œ, then mo(n) < nC<sq) for all n. 


Proof. The proof is by induction on k and n. For k = 1, ,C<o = 1 < m°(n) 
implies that C contains at least two elements, so for some singleton G = 
{x}, A°(G) =2 as desired. If k >n, then nC<k-1 = 2" > m°(n), so the 
assumption implies k <n. 

Now assume that the theorem holds whenever k < K and n > k, for all 
C. Fix k := K + 1. For n < k, as noted, it holds vacuously. For n = k, the 
hypothesis m(n) > nCzn-1 = 2” — 1 implies m? (n) = 2” as desired. Then 
to continue the proof by induction on n for k fixed, supposing the statement 
holds for n < N, it will be proved for n = N + 1. Let H, := {x1,..., Xn} be 
a set with n elements.Suppose A°(H,) > nC<xg. Let Hy := {x1,..., xy}. If 
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AC (Hy) > nCex, then by induction assumption, m°(k) = 2* as desired. So 
assume 


A°(Hy) < NCK. (4.1) 


Let C, := Ha nC := {A N H, : A €C}. Call a set E C Hy full iff both E 
and E U {x,} belong to C,. Let f be the number of full sets. Then the map 
At AN My takes CN H, onto C n Ay and is two-to-one onto full sets and 
one-to-one onto non-full sets in C n Hy. Thus 


A°(H,) = A°(Hy) + f. (4.2) 


Let F be the collection of all full sets. Suppose f = A? (Hy) > nCz<K-1. 
Then by induction assumption there is a G C Hy with card(G) = K and 
A*(G) = 2*. For J := G U {x,} we then have card(J) = k and AC(J) = 2% 
as desired. 

In the remaining case, f < nC<g-1. Then by (4.1) and (4.2), 


A°(Hn) < nCex + nCex-1 = nCex 


by Proposition 4.1, a contradiction. So the first sentence of the Theorem is 
proved. The second follows from the definition of S(C). 


For fixed k, „C< is easily seen to be a polynomial in n of degree k, with 
leading term n*/k!. Thus, the next fact is not far from optimal: 


Proposition 4.3 (Vapnik and Cervonenkis) For any nonnegative integers n 
and k with n> k+2, „Czk < 1.5n*/k!. For k> 1 we have 1.5n*/k! < 
(ne/ ky. 


Proof. The latter inequality holds by the simplest form of the Stirling formula 
with error bounds, Theorem 1.17, k! > (k/eyk/2xk, and since 1.5 < /2z. 

The first inequality clearly holds for k = 0, so assume k > 1. By the binomial 
theorem, n‘—!(k + n) < (n+ 1)‘, so 


nk" (k —1)1+n*/k! < (n+ DEK. a 


The proof will be done by induction on n and k. For k = 1 the inequality is 
n+ 1 < 1.5n, which holds for all n > 2, and so for n > k +2 =3. 
For n = k + 2, the desired inequality is 


2" —n—-1 < 1.5n"?/(n—2)! = 1.5(n— Ln" /nl. 


This can be checked directly for n = 3,4, 5, and 6. Stirling’s formula with an 
error bound (Theorem 1.17) gives 


n! < (ZY exm!euerm, 
e 
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so it will be enough to prove 
2" (=) Qan) e0 < 1.5(n —1)n""!, n>7, 
e 


which follows from (e/2)} > 2n!/?, n > 7; for f(x) := (e/2)* and g(x) := 
2x!/? it is straightforward to check that f(7) > g, F> g'h, f” >0 
and g” < 0, so f(x) > g(x) for all x > 7. 

Now suppose the Proposition has been proved for n = k + i, i = 2, ..., j, 
and forn = k + J, J := j + 1, for k = 1,..., K, as we have done for j = 2 
and for K = 1. We need to prove (4.3) for n = k + J and k = K + 1. We have 
k+j= K +]J and 


nC<k = kp Ck 
< k+jCek + xzsCex by Proposition 4.1 
< 1.5(k + j) /k!+1.5(K + J)*/K!_ by induction hypotheses 
< 1.5n*/k! by (4.3), 


completing the proof. 


Combining Theorem 4.2 and Proposition 4.3 gives m°(n) < 1.5n*/k! for 
n > k +2 where k := S(C). To see that Theorem 4.2 is sharp, let X be an 
infinite set and C the collection of all subsets of X with cardinality k. Then 
S(C) = k and the inequality in the second sentence of the theorem becomes an 
equality for all n. 

Let 


dens(C) := inf{r > 0: forsome K < oo, m°(n) < Kn’ forall n > 1}. 
Then we have 


Corollary 4.4 For any set X and C C 2%, dens(C) < S(C). Conversely if 
dens(C) < œœ then S(C) < œ. 


Proof. By Theorem 4.2 and Proposition 4.3, there is a K such that m°(n) < 
Kn*© for all n > S(C) +2. The same holds for all n > 1, possibly with a 
larger K, so dens(C) < S(C). Conversely if dens(C) < 00, then since m°(n) < 
Kn’ < 2" for n large we have S(C) < oo. 


Note that $(C) can be determined by one large shattered set while dens(C) 
has to do with the behavior of C on arbitrarily large finite sets. For example, if 
X is a set with card(X) = n and C = 2*, then S(C) = n while dens(C) = 0. 

For any set X, it is immediate that if C Cc D C 2*, then S(C) < S(D) and 
dens(C) < dens(D). 

The following is straightforward since for any set X, the map A > X \ A 
is one-to-one from 2* onto itself, and for any A,B,C CX, ANBACNB 
if and only if (X \ ANN BA(X\C)NB: 
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Proposition 4.5 If X is any set, C C 2% and D :={X\ A: A €C} then for 
all B C X, A°(B) = A?(B), so m? (n) = m?(n) for all n, S(D) = S(C) and 
dens(D) =dens(C). 


4.2 Generating Vapnik—Cervonenkis Classes 


First, here are some examples of non-VC classes for which some uniform limit 
theorems for empirical measures fail. 

First, let X = [0, 1] and let C be the class of all finite subsets of X. Let P 
be the uniform (Lebesgue) law on [0, 1]. Clearly, S(C) = +-oo, and C is not a 
VC class. Also, for any possible value of P,,, we will have P,,(A) = 1 for some 
A= {X,..., Xn} E€ C while P(A) = 0. Thus sup,cce(Pn — P)(A) = 1 for all 
n, so C is not a Glivenko—Cantelli class for P, in other words, 


I| Pn — Pllc := sup |P, — P)(A)| 
AEC 


does not approach 0 as n — oo in any sense, e.g. in outer probability, since it 
is identically 1. It follows that C is also not a Donsker class for P. 

Note that all functions 14 for A € C equal 0 almost surely for P. Thus, the 
whole class F := {14 : A € C} reduces to the one point 0 in the space L?(P) 
of equivalence classes for equality almost everywhere of functions in £?(P), 
that is, measurable, square-integrable functions. Thus for purposes of empirical 
processes, functions equal a.s. P are not the same, and we need to deal with 
classes F C £?(P) of actual real-valued functions, not equivalence classes. 
Then, the integral f fd(P, — P) will be well-defined for any f € £L7(P). This 
integral is linear in f and thus prelinear for f € F for any set F C L°(P). 
For the empirical process v, = n'/*(P, — P) we will not be taking versions or 
modifications as was done for Gaussian processes (Appendix I). 

Next, let C, be the collection of all closed, convex subsets of R?. Let S! 
be the unit circle {(x, y): x?” + y? = 1}. For any finite subset F of S!, the 
convex polygon with vertices in F (a singleton if |F| = 1, or a line segment 
if |F| = 2) is in C and its intersection with S! is F. Thus S(C) = +00 and C 
is not a VC class. Let P be the uniform law d P(0) = 20/(277) on S!. Then 
the Glivenko—Cantelli and Donsker properties fail for P just as in the previous 
example. 

Classes with S(C) finite, in other words Vapnik—Cervonenkis classes, can 
be formed in various ways. Here is one. Let G be a collection of real-valued 
functions on a set X. Let 


pos(g) := {x : g(x) > 0}, nn(g) := {x : g(x) = 0}, g EG, 
pos(G) := {pos(g): g € G}, nn(G) := {nn(g): g € G}, 
U(G) := pos(G) U nn(G). 
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Theorem 4.6 Let H be an m-dimensional real vector space of functions ona set 
X, f any real function on X, and H; := {f +h: h € H}. Then S(pos(H1)) = 
S(nn(H)) = m. If H contains the constants, then also S(U(H1)) = m. 


Proof. First it will be shown that S(pos(H1)) = m. Clearly card(X) > m. If 
card(X) = m, then H = H; is the set R* of all real-valued functions on X, so 
the result holds. 

Otherwise, let A C X with card(A) = m + 1. Let G be the vector space 
{af +h: a€R, he A}. Letra: Gt R^ be the restriction of functions in 
G to A. If r4 is not onto, take 0 4 v € R^ where v is orthogonal to r4(G) for 
the usual inner product (-, -) 4. 

Let Ay := {x € A: v(x) > 0}. We can assume A is nonempty, replacing v 
by —v if necessary. If A, = AN pos(g) for some g € G, then (r4(g), v)4 > 0, 
a contradiction. So pos(G) doesn’t shatter A. 

Suppose instead that r4(G) = R^. Then r4 is 1-1 on G, f ¢ H, and r4(H;) 
is a hyperplane in Rê not containing 0, so that for some v € R4, (j, v)4 = 
—1 for all j e ra(Hı). Again let Ay := {x € A: v(x) > O}. If Ay = AN 
pos(f + h) for some h € H, then (f +h, v),4 > 0, a contradiction (here A+ 
may be empty). Thus pos(H;) never shatters A, so S(pos(H)) < m. 

For each x € X, a linear form ôx is defined on H by 6,(h) := h(x), h € H. 
Let H’ be the vector space of all real linear forms on H. Then H’ is 
m-dimensional. Let Hy be the linear span in H’ of the set of all ôx, 
xEX, 


Hg := į J ajô, : xj €X, aj eR, r=1,2,...}Ẹ. (4.4) 
j=1 


The map h |> (Wh Y(h): he H, y e Ay, is 1-1 and linear from H 
into (H4), so Hy is m-dimensional. Take B = {x1, ..., Xm} C X such that 
the 5,, are linearly independent in H’. So rg(H) = R”, rg(Hı) = R®, and 
pos(rg(H1)) = 28, so S(pos(Hi)) = m. 

Then S(nn(H;)) = m by taking complements (Proposition 4.5). If H con- 
tains the constant functions, then the sets nn( f), f € H4, are the same as the 
sets {f >t}, f € Hı, t € R, and the sets pos(f), f € Hı, are the same as 
the sets {f >t}, f € Hı, t e R. Now for any finite subset A of X, f € My 
and t € R, since f takes only finitely many values on A, there exist s and u 
such that AN {f > t} = AN{f >ssandAN{f >t} =AN{f > u}. Soin 
this case S(U(H1)) = m. 


Examples. (I) Let H := Py, be the space of all polynomials of degree at most 
k on R¢. Then for each d and k, H is a finite-dimensional vector space of 
functions, so pos(H) is a Vapnik—Cervonenkis class. For k = 2, it follows 
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specifically that the set of all ellipsoids in R is included in a Vapnik— 
Cervonenkis class and thus is one. 


(II) Let X = R. Let H be the 1-dimensional space of linear functions f(x) = 
cx, x €R, cER. Then S(pos(A)) = S(nn(A)) = 1 by Theorem 4.6, but 
U(H) shatters {0, 1}. Since sets in U (H) are convex (half-lines), it follows that 
S(U(#)) = 2. So the condition that H contains the constants cannot just be 
dropped from Theorem 4.6 for U (H). 


Let X be a real vector space of dimension m. Let H be the space of all 
real affine functions on X, in other words, functions of the form A + c where 
h is real linear and c is any real constant. Then H has dimension m + 1, and 
pos( H) is the set of all open half-spaces of X. Letting f = 0 in Theorem 4.6 
for this H gives a special case known as Radon’s Theorem. On the other hand, 
Theorem 4.6 for f = 0 with general X and H follows from Radon’s Theorem 
via the following stability fact: 


Theorem 4.7 If X and Y are sets, F is a function from X into Y, C C 2°, and 
F-'(C):= {F7'(A): A €C}, then S(F-'(C)) < S(C). If F is onto Y, then 
SFC) = SO). 


Proof. Let F~'(C) shatter {x}, ..., Xm} where x; Æ x; fori A j. Then F(x;) # 
F(x;) fori # j and C shatters {F(x,),..., F(%m)}. So S(F-'(C)) < S(C). If 
F is onto Y and H C Y with card(H) = m, choose G C X such that F takes 
G 1-1 onto H. Then if C shatters H, F~'(C) shatters G, so S(F~'(C)) = 
S(C). 


Now let X be any set and G a finite-dimensional real vector space of real 
functions on X. Then there is a natural map F : x +> ô, from X into the space 
of linear functions on G. Then by Theorem 4.7 one could deduce Theorem 
4.6 from its special case where X is an m- or (m + 1)-dimensional real vector 
space and f and all functions in H are affine, so that sets in pos(H) are open 
half-spaces. 

Next it will be seen how a bounded number of Boolean operations preserves 
the Vapnik—Cervonenkis property. 


Theorem 4.8 Let X be a set, C C 2%, and fork =1,2,..., let C® be the 
union of all (Boolean) algebras generated by k or fewer elements of C. Then 
dens(C™) < k-dens(C), so if S(C) < œ, then S(C) < ov. 


Proof. Let dens(C) = r, so that for any £ > 0 there is some M < oo such that 
mE (n) < Mn'* for all n. 

For any A C X we have ANC = An(ANC)™. An algebra A with k 
generators A;,..., Ag has at most 2* atoms, which are those nonempty sets 
that are intersections of some of the A; and the complements of the rest. Sets 
in A are unions of atoms, so |A| < 2”. Thus |ANC®| < 2*]Anc < 
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2” M¥ | A|K"+*). Letting £}0 gives dens(C®) < k-dens(C). If S(C) < 00, then 
by Corollary 4.4, S(C™) < oo. 


The constant 2% is very large if k is at all large. Let C(™ be the class of all 
intersections of at most k sets in C. Then C™ c C®, For C, bounds in the 
preceding proof can be replaced by |ANC™| < J|Anc|k < M*|A|K+#), 
so that the constant 2” is not needed. 

Theorems 4.6 and 4.8 can be combined to generate Vapnik—Cervonenkis 
classes. For example, half-spaces in R? form a VC class. Intersections of at 
most k half-spaces give convex polytopes with at most k faces, so these form a 
VC class. 


Remarks. Let X be an infinite set, r = 1,2,..., and C, the collection of all 
subsets of X with at most r elements. Then clearly dens(C,.) = S(C,.) = r. It is 
easy to check that D := C consists of all sets B such that either B or X \ B 
has at most kr elements. Thus m?(n) < 2(nC<kr), with m?(n) = 2(nC<xr) for 
n > 2kr + 1. So dens(D) = kr since „C<; is a polynomial in n of degree j. 
Thus the inequality dens(C) < k- dens(C) is sharp. But it does not always 
hold for S(-) in place of dens(-): if C is the collection of open half-spaces in 
Rf, d > 1, then S(C) = d + 1 by Radon’s theorem. For example, taking C as 
the set of d half-spaces {x; > 0} for j = 1,...,d, we see that C™ shatters a 
set of 27 points, one in each coordinate orthant, so S(C) > 24 > d(d + 1) for 
d>5. 


Classes with V(C) = 0 or 1 are easily characterized: 


Proposition 4.9 A class C of subsets of a set X has V(C) = 0, or equivalently 
S(C) = —!1, ifand only if C is empty. Also, V(C) = 1, or equivalently S(C) = 0, 
if and only if C contains exactly one set. Thus S(C) > 1 if and only if C contains 
at least two sets. 


Proof. Clearly C shatters the empty set if and only if C contains at least one 
set. If C contains at least two sets, then for some A, B e C andx € X, xe 
A \ B. Then C shatters {x}, so S(C) > 1. Conversely if S(C) > 1, then clearly 
C contains at least two sets. 


A collection C of sets is said to be linearly ordered by inclusion if for 
any A, B € C, either A C B or B C A. Here are two sufficient conditions for 
S(C) = 1: 


Theorem 4.10 Jf C is a collection of at least two subsets of a set X, then 
S(C) = 1 if either 
(a) C is linearly ordered by inclusion, or 


(b) Any two sets in C are disjoint. 
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Proof. In any case S(C) > 1. If C is linearly ordered by inclusion, suppose it 
shatters {x, y} for some x Æ y.LetA, BEC, AN {x, y} = {x}, BN {x, y} = 
{y}. But A C Bor B C A, giving a contradiction. 

If the sets in C are disjoint, then we can argue as in part (a), and now 
take C € C with {x, y} C C, but C cannot be disjoint from A or B, a contra- 
diction. 


Example. Let X = R and let C be the collection of half-lines (—0o, x] for all 
x € R. Then C is linearly ordered by inclusion, so $(C) = 1. Applying empir- 
ical processes ./n(P, — P) to this class of sets gives the classical empirical 
processes ./n(F,, — F) of Chapter 1. 


Section 4.4 will go more into detail about classes of index 1. 


4.3 *Maximal Classes 


Starred sections are referred to later, if at all, only in other starred sections. 

Let C C A be classes of subsets of a set X. Then C will be called (A, n)- 
maximal if S(C) = n and if C C D strictly and D C A, then S(D) > n. If A 
is the class 2* of all subsets of X, then C will be called n-maximal. If C is 
n-maximal, then clearly C is (A, 2)-maximal for any A such that C C A C 2*. 

In view of Proposition 4.9, classes C with S(C) = i, i = —1 or 0, are empty 
or contain just one set respectively, and so are always i-maximal. Thus n- 
maximality is interesting only for n > 1. 


Examples. 1. For any set X, let C consist of Ø (the empty set) and all singletons 
{x} for x € X. Then C is clearly 1-maximal. 


2. Let X = R. Let CH consist of Ø, R, and all left half-lines, closed (—oo, x] 
or open (—oo, x), for x € R. In other words, CH is the collection of all subsets 
A C R such that whenever x < y € A then also x € A. Then clearly S(CH) = 
1 since for x < y and A € LH, AN {x, y} Æ {y}. But if any subset of R not 
in LH is adjoined, then some 2-element set is shattered, so CH is 1-maximal. 


3. Let X = R and let Co consist of all subintervals of R, namely Ø, R, any 
closed or open, left or right half-line, and any bounded interval, open or closed 
at either end. In other words, Co is the class of all convex subsets of R. Then 
S(Co) = 2; in fact, Co shatters every 2-element subset of R, whileifx < y < z 
and A € Co, then AN {x, y, z} Æ {x, z}. On the other hand if any set not in 
Co is adjoined to it, its index becomes 3, so Co is 2-maximal. 


Here is an existence theorem for maximal classes: 


Theorem 4.11 Let X be asetandD C 2*. Suppose thatC C D and S(C) = n. 
Then there exists a (D, n)-maximal class B with C C B. 
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Proof. Zorn’s Lemma (RAP, Section 1.5) will be applied. Let (Ba)aez be such 
that C C Ba C D for all @ in the index set J and such that the 6, are linearly 
ordered (form a chain) by inclusion, with S(B,) = n for all æ. Let A := LU, Bu. 
To show that S( A) = n, suppose A shatters a set F with |F| = n + 1. Each 
of the 2”+! subsets of F is induced by a set in some By. Since there are only 
finitely many of these sets and the 6, are linearly ordered by inclusion, there 
is some @ such that By shatters F, a contradiction. So the chain {By}ve; has 
an upper bound. So by Zorn’s Lemma the collection of all B with C CB CD 
and S(B) = n has a maximal element (for inclusion). 


The following fact is straightforward: 


Proposition 4.12 For any set X, YC X,CC 2%, and Cy := CNY, we have 
S(Cy) < S(C). 


Recall that Z2 := {0, 1} with addition mod 2, in other words the usual addition 
except that 1 + 1 = 0. For any set X, the group ZX of all functions from X into 
Zo, with the natural addition (f + g)(x) := f(x) + g(x) in Z2, provides a group 
structure for the collection 2* of all subsets of X. Addition of indicator functions 
mod 2 corresponds to the symmetric difference AAB := (A \ B) U(B \ A), 
so that 14 + 1g = 14az mod 2. For any fixed set A C X, the translation 1g bt 
14 +1, takes ZX one-to-one and onto itself. If the functions are restricted 
to a subset Y C X, translation still takes Zy one-to-one and onto itself. For 
any A C X and C C 2%, let AAAC := {AAC : C eC}. Then for any finite 
F C X, C shatters F if and only if AAAC does. It follows that: 


Proposition 4.13 For any fixed set A C X and class C C 2*, we have S(C) = 
S(AAAC), and C is n-maximal if and only if ANAC is. 


Next, we have: 


Proposition 4.14 [fC is an n-maximal class of subsets of a set X, and n > 1, 
then acc A = X and()\,-c A= 9. 


Proof. Suppose there is an x such that x ¢ A for all A € C. Take any B € C 
and let D := C U (B U {x}). Then by n-maximality, D must shatter some set F 
of cardinality n + 1 > 2, and evidently x € F. Thus D contains at least 2” > 2 
sets which contain x, but it contains only one, a contradiction. This proves the 
first conclusion. The second follows on taking complements, setting A = X in 
Proposition 4.13. 


On ZX = 2* there is a product topology coming from the discrete topol- 
ogy on Z2. The product topology is compact by Tychonoff’s theorem (RAP, 
Theorem 2.2.8). 
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Proposition 4.15 For any set X, any n-maximal class C C 2* is closed and so 
compact in 2*. 


Proof. Suppose Cy — C is a convergent net in 2* with C, € C for all a. 
Then for any finite set F C X, there is some a with Cae N F = CAF, so 
S(C U {C}) = S(C) and C € C. So C is closed (RAP, Theorem 2.1.3). 


A class C of subsets of a set X will be called complemented if X \ A € C 
for every A € C. 


Theorem 4.16 If S(C) = n, C C A strictly, and C is complemented, then C is 
not (A, n)-maximal. 


Proof. For any finite set F C X and G C F,G € Cn Fifandonlyif F \ Ge 


Cn F. So if |F| =n+1, then |CNF| < 2"t! —2. So, for any A € A\C, 
S(CU {A}) =n. 


If F is a k-dimensional real vector space of real-valued functions on a set X 
containing the constants and C is the collection U (F) of all sets {x : f(x) > O} 
or {x : f(x) > 0} for all f € F and real t, then S(C) = k by Theorem 4.6. 
Since C is complemented, it is never kK-maximal. 

Let X be any set and C = C, the collection of all subsets of X with at most k 
elements. Then clearly $(C) = k. Also, C is k-maximal since if A ¢ C, A C X, 
then |A| > k, andif B is any subset of A with |B| = k + 1, then B is shattered 
by C U {A}. For C = Cy we have mo(n) = nC<k, Which is the maximum 
possible value of m°(n) by Sauer’s Lemma (Theorem 4.2). The following 
example shows that not all k-maximal classes have these values of mÊ (n): 


Example. Let X = {1, 2, 3, 4}, 
G = {{4}, {1, 3}, {2, 3}, (3, 4}, {1, 2, 3}, {1, 2, 3, 4}. 


Let C be the complement of G in 2*. Then it can be checked that C is 2-maximal 
but |C| = 10 < 4Ce2 = 11. 


4.4 *Classes of Index 1 


In this section the structure of classes C with S(C) = 1 will be treated. Recall 
that for classes of two or more sets, disjoint classes and classes linearly ordered 
by inclusion have S(C) = 1 (Theorem 4.10). A common extension of these two 
kinds of classes is given by treelike partial orderings, defined as follows. 

A binary relation < ona set X will be called a quasi-order if it is transitive: 
x < y and y < z imply x < z, and reflexive: x < x for all x € X. The quasi- 
order is called a partial order if also x < y and y < x imply x = y. For any 
set S, inclusion ( C ) is a partial order on 25 or any subset of 2°. 
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Let < be a quasi-order on a set X. Then two elements x and y of X are 
called comparable if at least one of x < y and y < x holds, or incomparable 
if neither holds. A quasi-order < on X will be called fully comparable if any 
two elements of X are comparable. A quasi-order < will be called sub-fully 
comparable if for any y € X and L, := {x : x < y}, the restriction of < to 
L, is fully comparable. A fully comparable partial order is called linear. A 
sub-fully comparable partial order will be called treelike. 


Theorem 4.17 Let C C 2* contain at least two sets and satisfy, for any x # y 
in X, 


AN {x,y} =@ forsome AEC. (4.5) 
In particular it suffices that Ø € C. Then the following are equivalent: 


(a) SC) = 1; 
(b) For every Y C X, the inclusion partial ordering of Cy := Y NC is treelike; 


(c) For every Y C X with |Y| = 2, the partial ordering of Cy := Y NC by 
inclusion is treelike. 


Proof. In proving (a) implies (b), since S(Cy) < S(C) and classes with S(C) < 1 
contain at most one set and trivially have a treelike ordering by inclusion, we 
can assume Y = X. If C does not have a treelike ordering by inclusion, there 
isa set D € C and B C D, C C D such that B and C are not comparable, so 
there exist some x € B \ C and y € C \ B. Take A from assumption (4.5). But 
then the sets A, B, C and D shatter {x, y} and S(C) > 2, a contradiction. So 
(b) holds. 

Now (b) implies (c) directly. If (c) holds and |Y | = 2, since 2” does not have 
a treelike ordering by inclusion, C must not shatter Y, so (a) follows. 


Proposition 4.18 Let X be a set and A C 2¥ where Ø € A and for any B and 
Cin A, BAC € A. IfC is (A, 1)-maximal and satisfies (4.5) for any x # y in 
X, then Ø € C and BAC €C for any B and C inC. 


Proof. If x # y and (4.5) holds for A, then A N {x, y} = Ø = Ø N {x, y}, so 
adjoining Ø to C does not induce any additional subsets of sets with two 
elements, and S(C U {@}) = 1 and by maximality Ø € C. 

Suppose B, C € Cand S(C U {B N C}) > 1. Then for some x 4 yin X, BN 
CN {x, y} # DN {x, y} forall D € C. Then by (4.5), we can assume x € B N 
C. If {x,y} C BNC C B, then taking D = B would give a contradiction, so 
y ¢ BAC. Now BN CN {x, y} = {x} # DN {x, y} for D = B or C implies 
y € B and y € C, again a contradiction. So BNC €C. 


Proposition 4.19 Let X be a set and C a finite class of subsets of X with 
S(C) = 1 such that for any x # y in X, (4.5) holds. Let D := D(C) consist 
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of Ø and all intersections of nonempty subclasses of C. Then S(D) = 1. For 
each nonempty set D € D there is a C := C(D) e D such that C C D strictly 
(C & D) and if B is any set in D with B C D strictly, then B C C. 


Proof. By Theorem 4.11, let C C E with € 1-maximal. Then by Proposition 
4.18 for A = 2* and induction, C C D C €E, so S(D) = 1. Clearly, D is finite. 
For each nonempty D e€ D, by Theorem 4.17, {B eD: B C D} is linearly 
ordered by inclusion and contains Ø, so it has a largest element C(D) other than 
D itself. 


Proposition 4.20 Under the hypotheses of Proposition 4.19, the sets D \ C(D) 
for distinct nonempty D € D are all disjoint and are nonempty. 


Proof. Let A # Bin D.If B C A, then B C C(A),so B anda fortiori B \ C(B) 
are disjoint from A \ C(A). Otherwise, A N B C B strictly, and then A N B C 
C(B), so again A \ C(A) is disjoint from B \ C(B). That A \ C(A) 4 Ø for 
A # Ø follows from the definitions. 


A graph is anonempty set S together with a set E of unordered pairs {x, y} 
for some x Æ y in S. Then S will be called the set of nodes and E the set of 
edges of the graph. The graph (S, E) is called a tree if 


(a) It is connected, in other words, for any x and y in S there is a finite n 


and x; € S, i = 0, 1,..., n, such that x9 = x, x, = y, and {xz_1, Xk} € E for 
k=1,...,n. 

(b) The graph is acyclic, which means that there is no cycle, where a cycle is a set 
of distinct x,,...,x, E€ S such that n > 3, and letting xo := Xp, {x,_1, Xk} € E 
fork =1,...,n. 


Theorem 4.21 (a) For m nodes, for any positive integer m, there exist con- 
nected graphs with m — 1 edges. 


(b) A connected graph with m nodes cannot have fewer than m — | edges. 


(c) A connected graph with m nodes has exactly m — 1 edges if and only if it is 
a tree. 


Proof. (a) is clear. (b) will be proved by induction. Itis clearly true form = 1, 2. 
Suppose (S, E) is a connected graph with |S| = m, |E| < m — 2, and m > 3. 
The edges in E contain at most 2m — 4 nodes, counted with multiplicity, so 
at least 4 nodes appear in only one edge each, or some node is in no edge, a 
contradiction. Select a node in only one edge and delete it and the edge that 
contains it. The remaining graph must be connected, but is not by induction 
assumption, a contradiction, so (b) holds. 

For (c), let (S, E) be a connected graph with |S| = m and |E| = m — 1. If 
the graph contains a cycle, we can delete any one edge in the cycle and the 
graph remains connected, contradicting (b). So (S, E) is a tree. 
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Conversely, let (S, E) be a tree with |S| = m. It will be proved by induction 
that |E| < m — 1. This is clearly true for m = 1,2. Suppose |E| > m > 3. 
Take a maximal set C := {x1,..., X4} C S such that the x; are distinct and 
{xj-1,x;} € E for j =2,...,k. Then there is no y Æ x2 with {x1, y} € E: 
y cannot be any xj, j > 3, or there would be a cycle, and there is no such 
y ¢ C since C and k are maximal. So we can delete the node xı and the 
edge {x1, x2} from the graph, leaving a graph which is still a tree with m — 1 
nodes and at least m — 1 edges, contradicting the induction hypothesis and so 
proving (c). 


Let the class D in Propositions 4.19 and 4.20 form the nodes of a graph G 
whose edges are the pairs {C(D), D} for D € D, D # Ø. 


Proposition 4.22 The graph G is a tree. 


Proof. If D has m elements, then there are exactly m — 1 pairs {C(D), D}, for 
D €?D, D # Ø. Starting with any D € D, we have a decreasing sequence of 
sets D D C(D) D C(C(D)) D -- -which must end with the empty set, so all 
sets in D are connected in G via the empty set and G is connected. Then by 
Theorem 4.21 it is a tree. 


Proposition 4.23 Let X be a finite set. Let C be 1-maximal in X and suppose 
(4.5) holds for all x 4 y in X. Then C = D(C) as defined in Proposition 4.19. 
The sets D \ C(D) for nonempty D € C are all the singletons {x}, x € X. If 
|X| = m then |C| =m + 1. 


Proof. C = D(C) by Proposition 4.19. Suppose that for some D € C, D \ C(D) 
has two or more elements. Then for some B, C := C(D) C B C D where both 
inclusions are strict. It will be shown that S(C U {B}) = 1. If not, then for 
some x Æ y, C U {B} shatters {x, y}, so B N {x, y} Æ F A {x, y} forall F € C. 
Letting F = C shows that B N {x, y} # Ø. Likewise, letting F = D shows that 
BNO {x, y} 4 {x, y}. So we can assume B N {x, y} = {x}. Taking F = D shows 
that y € D.If GN {x, y} = {y} for some G € C, then y € G N D e C, and G N 
D C D strictly, so G N D C C and y € C C B, giving a contradiction. So C U 
{B} does not shatter {x, y}, and S(C U {B}) = 1, contradicting 1-maximality 
of C. 

So, each set D \ C(D) for Ø 4 D €C is a singleton. Each singleton {x} 
equals D \ C(D) for at most one D € C by Proposition 4.20. By Proposition 
4.14, X = U pec D. For any x € X take Dı € C with x € Dı. Let Dai := 
C(D,) for n = 1,2,.... For some m, Dm = Ø, and {x} = D; \ C(D;) for 
some j < m. So all singletons are of the form D \ C(D), D e C. This gives a 
1-1 correspondence between singletons and nonempty sets in D, so there are 
exactly m such sets and |D| = m + 1. 
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Suppose in this paragraph (only) that C is a class of two or more sets 
such that (4.5) holds with Ø replaced by {x, y}. Then the class of complements, 
N := {X \ C : C e C}, satisfies the original hypotheses of Proposition 4.17. If 
C is 1-maximal, so is M by Proposition 4.13. So Theorem 4.17 and Propositions 
4.18 through 4.22 apply to M, and so does Proposition 4.23 if X is finite. Then, 
C itself has a “cotreelike” ordering, where for each C € C, {D EC: CC D} 
is linearly ordered by inclusion. Propositions 4.18 and 4.19 apply to C if Ø 
is replaced by X and intersections by unions; in Proposition 4.19, we will 
have an immediate successor D(C) D C instead of a predecessor. We take sets 
D(C) \ C instead of D \ C(D) in Propositions 4.20 and 4.23. The resulting 
tree (Proposition 4.22) then branches out as sets become smaller rather than 
larger. 


Next will be several facts in the general case, i.e. without the hypothesis 
(4.5). 


Theorem 4.24 Let X be any set andC any collection of subsets with S(C) = 1. 
Then for any C € C, the collection Cx\c := {B \ C : B € C} satisfies (4.5) for 
any x # y as a collection of subsets of X \ C. Likewise, Cc, := {C \ B: Be 
C} satisfies (4.5) for any x +£ y as a collection of subsets of C, S(Cc\) < 1, and 
S(Cx\c) < 1. 


Proof. Letting B = C shows that both classes Cc, and Cy\c contain Ø, so (4.5) 
holds for them. Both have index S$ < 1 by Propositions 4.12 and 4.13. 


So, for an arbitrary class C with S(C) = 1, we have by Theorem 4.17 a 
treelike inclusion partial ordering in one part X \ C of X and a cotreelike 
ordering in the complementary part C, for any C € C. If also X \ C happens to 
be in C, both orderings are linear. To see how the two orderings fit together in 
general, Proposition 4.13 gives: 


Corollary 4.25 Let C be any class of sets with S(C) = 1 and A € C. Let D := 
AAAC. Then S(D) = 1 and Ø € D. If C is 1-maximal, so is D. Then Theorem 
4.17, Proposition 4.18, and if C is finite, Propositions 4.19, 4.20, 4.22, and if 
X is finite, 4.23, apply to D. 


The last sentence in Proposition 4.23 has a converse and extension: 


Proposition 4.26 Let X be finite with m elements and C C 2* with S(C) = 1. 
Then C is I1-maximal if and only if |C| = m + 1. 


Proof. For any fixed C € C, we can replace C by CAAC without loss of 
generality by Theorem 4.13. So we can assume Ø € C, and then (4.5) holds for 
all x, y. Then Proposition 4.23 implies “only if,” 

Conversely, let S(C) = 1 and |C| = m + 1. Let C C D strictly. Then |D| > 
m + 1, so by Sauer’s Lemma (Theorem 4.2), S(D) > 2. So C is 1-maximal. 
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Now, m + 1 = mC<1, which is the maximum value of m? (m) for S(C) = 1 
by Sauer’s Lemma (Theorem 4.2). The example at the end of Section 4.3 
shows that Proposition 4.26 in the form |C| = mC<ķx, k = 1, does not extend 
to k-maximality for k > 1. 

Next it will be shown that 1-maximality can be relativized to subsets. For a 
set X, a subset Y C X, and a class C C 2*, recall thatCy := CN Y := {ANY : 
AEC} 


Theorem 4.27 IfC is l-maximaland Ø + Y C X, then Cy is a 1-maximal class 
of subsets of Y. 


Proof. Let A € C. Without changing 1-maximality, C can be replaced by A A AC 
(Proposition 4.13). So we can assume Ø € C. We can also assume that |X| > 2. 


Case I. Suppose X is finite, |X| = m < oo. Then by Proposition 4.23 or 4.26, 
IC] =m + 1. Let x € X and Y := X \ {x}. Let B := {y € Y : {x, y} € Cix,y) 
and {y} € Cix,y}}. A set C C Y will be called full if C € C and C U {x} € C. To 
continue the proof of the theorem, we have the following: 


Lemma 4.28 If A C Y and A is full, then A = B. 


Proof. Suppose A is full. First suppose y € A \ B. Then A U {x} € C implies 
{x, y} € Cix,y} and A € C implies {y} € Cix, y}, contradicting y ¢ B. 

Next, suppose y € B \ A. Then {x} = (AU {x})/N {x, y}, Ø =ØN {x, y}, 
and y € B imply that C shatters {x, y}, a contradiction. So A = B, proving the 
Lemma. 


Continuing the proof of Theorem 4.27, any two distinct sets in C have 
different intersections with Y, except possibly for B and BU {x}. Now 
|C| = m + 1 by Proposition 4.26, so |Cy| > m and |Cy| = m since S(Cy) < 1 
and |Y| = m — 1. So if m > 2, Cy is a 1-maximal class of subsets of Y by 
Proposition 4.26 again. Then by induction downward, Cy is 1-maximal in Y 
for any Y C X, Y #9, proving the Theorem if X is finite. 


Case 2. Let X be general and Ø Æ Y finite. Suppose Cy is not 1-maximal in 
Y. Then by Case 1, for any finite Z D Y, Cz is not l-maximal in Z. Let € 
be the class of all B C Y such that B ¢ Cy and S({B}U Cy) < 1. Suppose for 
each B € € there is some finite Z(B) D Y such that there is no A C Z(B) 
with S({A} UCzg)) < 1 and ANY = B. Let Z := Upee Z(B), a finite set 
including Y. So Cz is strictly included in some class D which is 1-maximal in 
Z, by Theorem 4.11 (with A = 27), and Dy is 1-maximal by Case 1. So B € Dy 
for some B € £E, say D N Y = B for some D € D. Then A := DN Z(B) gives 
a contradiction. 

So, for some B € E and any finite Z D Y, there is an A C Z such that 
ANY = B and S(Cz U {A}) < 1. Let Dz be the class of all subsets D of X for 
which DN Z is such a set A. Then Dz is compact in the product topology of 
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2* (which was treated in Proposition 4.15). For any finite U > Y and V D Y, 
Dy N Dy D Dyuy Æ Ø. So the intersection of all Dy is nonempty. Let D € Dy 
for all finite U > Y. Then S(C U {D} = 1, so DEC, but DN Y = B €Cy,a 
contradiction. This finishes the proof in Case 2. 


Case 3. Let X be infinite and U any infinite subset. Let F(U):={DCU: 
DAY €Cy for all finite Y C U}. Then clearly Cy C F(U). Conversely, if 
D €e F(U), then for each finite Z C X, let Y := Z N U. Let G(Z) := {A € 
C: ANY = DANY}. Then G(Z) is compact by Proposition 4.15, nonempty 
since D € F(U), and decreases as Z increases. Thus there is some C € (]7 Gz. 
Then for any finite Z, CN ZNU =DNZNU,soCNU = DANAU =D.So 
D € Cy and Cy = F(U). 

For each finite, nonempty Y C U, Cy is 1-maximal in Y by Case 2. Sup- 
pose F(U) is not 1-maximal in U. Let A C U, A ¢ F(U), S(E) < 1 where 
E = {A} U F(U). For each finite Y C U, Cy C Ey so Cy = Ey. Let Hy := 
{BeC: BAY = ANY}. Then Hy ¥ Ø, and as shown above, (|), Hy # Ø. 
Let B € (y Hy. Then A = BNU € Cy = F(U), a contradiction. So Cy is 
1-maximal in U, and Theorem 4.27 is proved. 


For any set X and C C 2%, let x <c y iff x = yor y € Upec B and for all 
AéC, y € A implies x € A. Then <c is a quasi-order (as defined early in 
this section) but in general not a partial order. The treelike partial orderings as 
in Theorem 4.17 were on collections of sets. Now orderings will be defined 
on X. 


Theorem 4.29 If S(C) = 1, Upec B = X, and C satisfies (4.5) for all x # y, 
then <c is a sub-fully comparable quasi-order on X. If C is also 1-maximal, 
then <c is a partial order. Conversely, for any quasi-order < ona set X, let 
C := C< := {A C X : A is fully comparably quasi-ordered by < and x € A 
whenever x < y € A}. Then S(C) < 1. If < is a treelike partial order and 
X Æ Ø, then C is I-maximal. 


Proof. Suppose S(C) = 1 and (4.5) holds for all x Æ y. Let x <c z and y <c z. 
Suppose z € B € C. If x and y are not comparable for <c, take C, D € C with 
x €C\D, y € D\ C.ThenC shatters {x, y}, a contradiction. So {x : x <e z} 
is fully comparably quasi-ordered by <ç and <c is sub-fully comparable. 

If C is 1-maximal, suppose x <ç y and y <c x. If x Æ y, the only subsets 
of {x, y} induced by C are Ø and {x, y}, contradicting Theorem 4.27. So x = y 
and <ç is a partial order. 

Next, let < be a quasi-order on X and C := C<. Take any x Æ y. If x < y, 
then {y} ¢ Cix,y} or if y < x, then {x} ¢ Cix,y}. If x and y are not comparable 
for <, then {x, y} € Cy,,y). So S(C) < 1. 

If < is a treelike partial order and X 4 @, then Ø € C. For any x € X, Ly := 
{y: yx<xx}eC,soS(C)=1. 
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Let A C X and suppose A is not fully comparably ordered by <. Take 
x,y € A which are not comparable for <. Since Ø, L,, and Ly are all in 
C, CU {A} shatters {x, y}, and S(C U {A}) = 2. 

Let B C X and suppose there exist x < y € B with x ¢ B. Then L, € 
C, Ly €C, and BN {x, y} = {y} imply that S(C U {B}) > 2. This has now 
been shown for any B ¢ C, so C is 1-maximal. 


Example. Let X = {1, 2,3, 4, 5} and 
G = {@, {1}, {5}, {2, 5}, {1, 2, 4}, (1, 2, 3, 4}, {1, 2, 3, 5}, {1, 2, 3, 4, SHH. 


Let C be the complement of G in 2*. Then it can be checked that C is 3- 
maximal but for Y = {1, 2, 3, 4}, Cy is not 3-maximal in Y. Soin Theorem 4.27, 
“|-maximal” and Y Æ Ø cannot be replaced by “3-maximal” and “|Y| > 3” 
respectively. 

Recall that a linearly ordered subset of a partially ordered set is called a 
chain. 


Theorem 4.30 Let C be a 1-maximal class of subsets of a set X containing Ø. 


(I) Then B € C ifand only if both (a) B is a chain for <ç and (b) ifx <c y € B, 
then x € B. 

(II) If X is finite, B € C if and only if B = Ø or for some z € X, B= {x : 
xX <e 2}. 


Proof. To prove “only if” in (1), (b) holds by definition of <c¢. To prove (a), 
suppose x, y E B are not comparable for <e. By Theorem 4.27 applied to 
singletons Y, we have X = UcecC. Thus for some D € C, y € D and x ¢ D, 
and for some E € C, x € E and y ¢ E. SoC D {G, D, E, B} shatters {x, y}, a 
contradiction. Thus (a) holds. 

Conversely, suppose (a) and (b) hold. Suppose C U {B} shatters some {x, y}. 
If x <c y, then CN {x, y} 4 {y} for C € C or C = B, a contradiction. So x 
and y are not comparable for <¢. Then C N {x, y} contains Ø, {x} and {y}, 
and so not {x, y}. Also BN {x, y} Æ {x, y}, giving another contradiction. So 
S(C U {B}) = 1 and since C is 1-maximal, B € C, proving “if.” 

For (ID, a B of the given form satisfies (b) clearly, and (a) holds because 
<c is treelike by Theorem 4.29, so B € C by part (I). Conversely, if B € C it is 
a chain for <c by (J), so if it is nonempty, it has a largest element z, and then 
B = {x : x <c z} by (a) and (b). 


4.5 *Combining VC Classes 
Recalling the density as in Corollary 4.4, the following is clear: 
Theorem 4.31 For any set X, if A C 2* and C C 2*, then 
dens( AU C) = max(dens(A),dens(C)). 
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For the Vapnik—Cervonenkis index we have instead: 


Proposition 4.32 For any set X, A C 2* and C C 2%, S(AUC) < S(A) + 
S(C) + 1. This bound is best possible: for any nonnegative integers k and 
m there exist X, A and C C 2* with S(A) =k, S(C)=m and S(AUC) = 
k+m+l. 


Proof. By Theorem 4.2, if k := S(A), m := S(C), and n >k+m+2, 
mA (n) < mA(n)+m(n) < Cex + nCen 

k h n i 

2o Pa 


j=0 J j=n—m 


so S(AUC) <n. It follows that S(AUC)<k+m+ 1 as claimed. Con- 
versely, given k and m, let n = k +m + 1 and let X be a set with n members. 
Let A be the set of all subsets of X with at most k members and C the set of 
subsets of X with at least n — m = k + 1 members. Then S(A) = k, S(C) =m 
and AUC = 2%, so S(AUC) =n. 


Let X be a set and C, D any two collections of subsets of X. Let 
CnD := {CND: CeC, DeD}, CUD := {CUD: CEC, DET}. 
If A is aclass of subsets of another set Y, let 
CHA := {Cx A: CEC, AEA 
For such classes, we have: 


Theorem 4.33 For any C C 2% and DC 2* or 2” let k := dens(C) and 
m :=dens(D). Then we have: dens(C OD) < k+m for O =n, uor X. 


Proof. For any of the three operations we have mo (n) < m°(n)m?(n), and 
the conclusions follow. 


For the Vapnik—Cervonenkis index the behavior of the EJ operations is not so 
simple. Fork, m = 0, 1,2, ..., and O = u, N or X let A(k, m) := max{S(C C 
D): S(C) =k, S(D) = m}. Here the maximum is taken where X and Y may 
be infinite sets. Then we have: 


Theorem 4.34 For any k = 0, 1,2,... and m = 0, 1,2,..., and O = u, n 
or Ñ, we have E\(k, m) < œœ. 


Proof. By Theorem 4.2 and Proposition 4.3, if S(C) = k and S(D) = m, then 


m™P (n) < m(nym?(n) < 9nk™ /(4k!m!) < 2” 


for n > no(k, m) large enough. 


Theorem 4.35 For any k,m =0,1,2,..., (k,m) = U(k, m) = K(k, m). 
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Proof. The first equation follows from taking complements (Proposition 4.5). 
For the second, for given k and m by Theorem 4.34 it is enough to consider large 
enough finite sets in place of X and Y, and then we can assume X = Y. We 
have N(k, m) < K(k, m) by restricting to the diagonal in X x X and applying 
Proposition 4.12. 

In the other direction let IIx and Iy be the projections of X x Y onto X 
and Y respectively. Let F := {m3 (C) t CEC}, B= {Ty'(A) : Ae A}. 
By Theorem 4.7, S(F) = S(C) and S(B) = S(A). Since Ty'(C) N My'(A) = 
C x A it follows that S(F n B) > S(CXA), and the Theorem is proved. 


Let S(k, m) be the common value of the quantities in Theorem 4.35. The- 
orem 4.34 can be improved as follows. For any nonnegative integers j, k let 
O(j,k) := sup{r EN: (-C<;X-C<k) = 2"}. Then 0(j, k) < œ for each j,k 
by Proposition 4.3 and we have: 


Proposition 4.36 S(j, k) < 0(j, k) for any j,k € N. 


Proof. Let S(C)= j and S(D) = k. Then for any n > 6(j,k), by Sauer’s 
Lemma (Theorem 4.2), and Proposition 4.3 again, 


mP (n) < me (n)mP (n) < (n Caj XaCa) < 2". 


This finishes the proof. 


Can the values S(k, m) be computed? The next two theorems and proposition 
will give some information. 


Theorem 4.37 Let X be a set, C, D C 2¥, and CU D =2*. Let A C X and 
suppose for all B €C, either B C A or B C A‘. Then D shatters either A or 
AS. 


Proof. Suppose D does not shatter A. Then take H C A such that D N A #4 H 
for all D € D. Take any E C A‘°. Then E U H = CUD for some C € C and 
D e D.IfC Cc AS then DN A = H, a contradiction. So C C A and DN AC = 
E. Thus D shatters A°. 


For any set Y recall that |Y | denotes the number of elements of Y. Here is 
an upper bound for S(1, k) that will be shown to be exact for k = 1, 2,3 in 
Proposition 4.39. 


Theorem 4.38 For any k = 1,2,..., S(1, k) < 2k + 1. 


Proof. Suppose |X| = 2k +2, C u D = 2*, S(C) = 1, and S(D) = k. We can 
assume by Theorem 4.11 that C is 1-maximal. Then Ugec B = X by Theorem 
4.27 applied to singletons Y. We have Ø € C N D. Thus by Theorem 4.17, 
C has a treelike partial ordering by inclusion, which induces such a treelike 
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partial ordering on X by Theorem 4.29. Let Y be the set of elements of X 
having at least one predecessor for this ordering. Each y € Y has a smallest 
predecessor f(y) ¢ Y. For each B C Y, we have B = CUD,CeC, DeD, 
where C = Ø, B = D since if y e CNY, f(y) € C \ Y. So D shatters Y and 
Y has at most k elements. 

Let r be the number of values of f, say t),...,¢,. Then Y is decomposed 
into disjoint subsets Y;,..., Y, such that f = t; on Y; for each j. Let C := 
(Y Uranf)°. Then |C| > 2. Letn; := |Y;|, j= 1,...,r. Then 


2k+2 = |Cl+ > (nj +1). (4.6) 
j=l 
It will be shown that there exist subsets E C C and Z C {1,...,r} such that 
|E|+ (nj t+) = k41. (4.7) 
jel 


Let K be the largest possible value < k + 1 of the left side of (4.7). Suppose 
K <k. Then 


K = |Ci+ +D (4.8) 
jes 
for some J C {1,...,7} since elements of C could be put into E one at a time. 
We then have by (4.6) 
Xaj +1) = 2k+2-K > k+2. (4.9) 
JEJ 


Let no be the smallest value of n; for j ¢ J. Then no > |C| + 1, or another j 
could be putin / for a suitable E on the left side of (4.7), giving a larger K. Since 
each n; < k, there must be at least two j ¢ J by (4.9). Thus r — 2 + 2ng < 
|IY| < k,r < k — 2|C]|, and by (4.6), 


2k+2—|C| = $ (n; +1) < 2k-2IC| 
j=l 
and |C| < —2, a contradiction, so K = k + 1 and (4.7) is proved. 
Thus there is a set A C X with |A| =k+1, A := EU Ujer Y U {tj}, 
with E and J from (4.7). Let B € C. Then by Theorem 4.30(1)(a), either B C 
Y; U {t;} for some j or B is a singleton. Thus either B C A or B C A®. So 
Theorem 4.37 applies and S(D) > k + 1, a contradiction. 


For k = 1, 2, 3 we have, where the lower bound for k = 2 is due to L. Birgé, 


Proposition 4.39 S(1, k) = 2k + 1 fork = 1, 2,3. 


16:42 


P1: KNP Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


CUUS2019-04 


CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


196 4 Vapnik-Červonenkis Combinatorics 


Proof. By Theorem 4.38 we need to show S(1, k) > 2k + 1 fork = 1, 2, 3. Sets 
{a, b, ..., d} will be denoted ab - - - d, e.g. 1246 := {1, 2, 4, 6}. For k = 1 let 
X := 123,C := {ø, 1,2,3}, and D := {ø, 1,2, 23}. Then clearly S(C) = 1, 
S(D) = 1, and S(C u D) = 3, so S(1, 1) > 3. 

For k = 2 let X := 12345,C := {ø, 1,2,3,4, 45}, 


D := {Ø, 1,2, 3, 5, 12, 13, 15, 23, 25, 234, 235, 2345}. 


Then one can check that S(C) = 1, S(D) = 2, and C u D = 2%. So S(1, 2) > 5. 

To show that S(1, 3) = 7, take the set X := {0, 1, 2, 3, 4, 5, 6}. We will find 
classes C C 2* and € C 2* with S(C) = 1, S(E) = 3, and CUE = 2*. 

Let C := {@,0, 1, 12, 3, 34, 5, 56}. Then C has a treelike partial ordering 
by inclusion and S(C) = 1. 

A set with k elements is called a k-set. € will contain the following subsets 
of X: the 0-set Ø; all 1-sets; all 2-sets except 12 and 34; all 3-sets not including 
12 or 34; all 4-sets included in 01234; and the 5-sets 01234 and 12346. Then € 
shatters some 3-sets, e.g., 246. To show S(E€) = 3 we need to show E shatters 
no 4-set. E shatters no 4-set containing 5 since there is no set A in E with 
cardinality |A| > 4 containing 5. 

A 4-set B C 01234 includes at least one of the pairs 12 or 34. By symmetry, 
suppose 12 C B. Each set C in € including 12 contains at least two of 0, 3 and 
4, so |C N B| > 3 and C N B F172. Thus E does not shatter B. It remains to 
consider 4-sets containing 6 and not 5. There is no A € E including 06 with 
|A| > 4. Thus € does not shatter any 4-set including 06. The sets 1236 and 1246 
are not shattered by € because the subset 126 is not cut from them. Likewise 
the sets 1346 and 2346 are not shattered because 346 is not cut from them. 
Thus S(€) = 3. 

To show C u E = 2%, clearly C u £ contains all 0- and 1-sets, and it is easy 
to check that it contains all 2-sets and all 3-sets A not including 12 or 34. If 
A D 12, then A = 12 Uc where 12e C and c e€ £, and likewise for 34. 

C u E contains X = 56 U 01234, 012345 = 5 U 01234, and 012346 = 0 U 
12346. Each other 6-set is the union of 56 € C and a 4-set in E£ included in 
01234. 

A 5-set containing 5 and not 6 is the union of 5 € C and a 4-set C 01234. 
A 5-set F containing 6 and not 5 includes at least one of the pairs P; = 12 or 
P, = 34. If it includes both pairs, it is in E. If it includes just one pair Pj, we 
have 


(*) F = AU(F\A), AEC, F\ AEE, 


for A = Pj. 
If a 5-set F D 56 includes a pair P}, then (*) follows likewise. Otherwise it 
holds for A = 56. The remaining 5-set, 01234, is in E. 
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A 4-set C 01234 is in E. A 4-set F containing 5 or 6 or both includes at most 
one pair P;. If it includes P;, (*) holds for A = P;. So suppose F includes 
neither pair P;. If 56 C F, then (*) holds for A = 56. If 5 € F and6 ¢ F, then 
(*) holds for A = 5. The remaining case is 6 € F and 5 ¢ F. At least one of 
a = 0, 1, or 3 is in F, and (*) holds for A = a. The proof of the case k = 3 of 
Proposition 4.39 is complete. 


For classes satisfying stronger conditions, more is true: 


Theorem 4.40 Let X and Y be sets, C C2% and Dc. If C is linearly 
ordered by inclusion, then S(CXID) < S(D) + 1. 


Proof. Letm = S(D). We can assume thatm < œo and X €C.LetF CX x Y 
with |F| = m + 2. Suppose CXD shatters F. The subsets of IIx F C X induced 
by C are linearly ordered by inclusion. Let G be the next largest other than TI y F 
itself and p € IIx F \ G. Take y such that (p, y) € F. Then all subsets of F 
containing (p, y) are induced by sets of the form Ily F x D, D € D. Thus 
H := F \ {(p, y)} is shattered by such sets. Now |H| = m + 1, and Iy must 
be one-to-one on H or it could not be shattered by sets of the given form. But 
then D shatters Ily H, giving a contradiction. 


Theorem 4.41 For any set X and C, D C 2%, if C is linearly ordered by inclu- 
sion, then S(C O D) < S(D) + 1 for O = N or u. 


Proof. First, consider L] = n. Again letm = S(D), and we can assume m < 00. 
Let F C X and |F| = m + 2. Suppose C n D shatters F. Now Cp is linearly 
ordered by inclusion. We can assume that X € C, so F € Cr. Let G be the 
next largest element of Cr and p € F \ G. Each set A C F containing p is 
of the form CN DN F,C €C, D eD, so we must have CM F = F, and 
A = DAF. So D shatters F \ {p}, a contradiction. 

Since the complements of a class linearly ordered by inclusion are again 
linearly ordered by inclusion, the case of unions follows by taking complements 
(Proposition 4.5). 


Then by Theorem 4.10 and induction we have: 


Corollary 4.42 Let C; be classes of subsets of a set X and C := {[\;—; Ci : 
Ci €C;, i=1,...,n}, where each C; is linearly ordered by inclusion. Then 
S(C) <n. 


Definition. For any set X and Vapnik-Červonenkis class C C 2*, C will be 
called bordered if for some F C X, with |F| = S(C), and x € X \ F, F is 
shattered by sets in C all containing x. 


Theorem 4.43 LetC; C 2¥® be bordered Vapnik-Červonenkis classes for i = 
1,2. Then S(C1XC2) > S(C1) + S(C2). 
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Proof. Take F; C X(i) and x; as in the definition of “bordered.” Let H : 
({x1} x Fo) U (F; x {x2}). Then |H| = S(C1) + S(C2). For any V; C Fj, i = 
1,2, take C; € C; with C; O F; = V; and x; € C;. Then (Cy x C2)N H = 
({x1} x V2) (Vi x {x2}), so Ci} KIC2 shatters H. 


Theorem 4.43 extends by induction to any number of factors. One conse- 
quence is: 


Corollary 4.44 In R let J be the set of all intervals, which may be open 
or closed, bounded or unbounded on each side. In other words J is the set 
of all convex subsets of R. In R” let C be the collection of all rectangles 
parallel to the axes, C := {T}; J; : Jı € J, i =1,...,m}. Then S(C) = 2m. 
Let D be the set of all left half-lines (—oo, x] or (—0o0, x) for x € R. Let 
T := (1,8): H; €D, i=1,...,m)}, so T is the class of lower orthants 
parallel to the given axes. Then S(C) = m. 


Proof. The class D is linearly ordered by inclusion and is bordered with 
S(D) = 1. So is the class of half-lines [a, oo) or (a, 00), a € R. The class 7 
of all intervals in R is bordered with S(J) = 2. The results now follow from 
Corollary 4.42 with n = 2m and n = m, and Theorem 4.43 and induction. 


Proposition 4.45 Let T be the set of all intervals in R. Let Y be any set and 
C C X with Y € C. Then in R x Y, S(JRC) < 24+ S(C). 


Proof. If S(C) = +œ the result is clear, so suppose m := S(C) < co. Let 
F C R x Y with |F| = 3 + m and suppose JXC shatters F. Let (x;, yi), i = 
1,...,m-+3, be the points of F. Let u := min; x; and v := max; xj. Let 
p := (u, yi) € F and q := (v, yj) € F. All subsets of F which include {p, q} 
must be induced by sets of the form R x C, C € C. So Iy must be one- 
to-one on F \ {p,q}, and C shatters Iy(F \ {p, q}) of cardinality m + 1, a 
contradiction. 


Next is a necessary condition for a class C to be of index 1. Recall that a 
chain of sets is a class of sets linearly ordered by inclusion. For any class D of 
sets let D’ := {A°: A e€ D}. 


Theorem 4.46 (A. Smoktunowicz) In a set X, let C C 2* and S(C) = 1. 
(i) If Ø € C then for some chains A and B,C C ANB. 
(ii) In general, for some chains A;, i = 1,2,3,4, 

Cc (A m A2) u (A3 E Ag)’. 


Proof. For (ii), for any A € C, C N A‘ and C'n A are VC classes of index 1, 
containing Ø, and 


CC(CNAS)UC(C’ NAY NA). 
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Assuming (i), C’ A C B3 n B4 for some chains 53, B, of subsets of A. Letting 
A; := B; UA‘, j = 3,4, we have (C'n AY n A C (A3 N Ag)’. So (i) implies 
(ii). 

To prove (i), by Theorem 4.11 we can assume C is 1-maximal. Thus since 
Ø €C, C has a treelike partial ordering by inclusion by Theorem 4.17. First 
suppose X is finite. By Theorem 4.29, take the treelike partial ordering of X 
induced by C. 

Any chain is included in a maximal chain (for inclusion), and in a finite set 
of n elements, a maximal chain is of the form 


{9, {ai}, {a1, a2}, ees {Q1, ++, Anh} 


and thus is equivalent to defining a linear ordering of the set, a; < az < -+ < 
an. To define our two chains A, B we will thus define two linear orderings 
<A, <p of X. This will be done recursively as follows. Take the elements of X 
having no predecessors (there must be at least one) and call them a1, ..., ag for 
some choice of indices. Let a) <4 dz <A: <A Ak, Ak <B Ak-1 <B `` <B 
a. 

Next, suppose a; has immediate successors aj, ..., ajy. Letaj <4 aji <A 

* <A ajr <A 4j41 Where “<4 aj+ı” is omitted if j = k. Also let aj <g 
ajr <B 4jr—1 <B +++ <B aj, <p aj—ı Where “<g aj_,” is omitted if j = 1. 
Iterating such definitions we get two linear orderings of X, each defining a 
chain of sets as above, so we get chains A, B. 

By Theorem 4.30, every element of C is a set of the form 


qa™ : 


C i= {ajs ajj = Aj jy jn} 


where aj, <c ajj <e +++ <e a™ and “<c” can be replaced by either <4 or 
<p. It can be checked easily that 


C= {(xeX: x<ya™}N{xeX: x <pa™}. 


Thus C is the intersection of a set in A and a set in B, and the finite case is 
done. 

Now suppose X is infinite, Ø € C C 2% and S(C) = 1. Let H := {J C 
X: JÆøØ, |J| < œ}. For each J € H take chains Pj;, i = 1, 2, such that 
CnNJ CPN Py2. Let h be a non-point ultrafilter of subsets of H, in other 
words: 


(1) h is a nonempty collection of nonempty subsets of H. 
(2) If A,B ehthen ANBEh. 

(3)If A€handACBCH, then Beh. 

(4) For all A C H, either A €E hor AS € h. 

(5) For each J € H, {J} €h. 
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The first three conditions make h a filter, the fourth an ultrafilter, and the 
fifth a non-point ultrafilter. Non-point ultrafilters exist by the axiom of choice 
(RAP, Theorem 2.2.4 and the statement after its proof). 

Given any indexed family of nonempty sets {A;} jey, TljeyAy is the set 
of all {ay}y;ex such that ay € A; for all J e H. For {az}jen, {bj}jen in 
Ije Ay, let {az }jeu =n {br}szex if and only if for some A € h, ajz = bj 
for all J € A. Let lim, A; be the set of equivalence classes of members of 
Ije Ay for the relation =,. Let {a Na be the equivalence class to which 
{az} Jey belongs. 

If 5; is a class of subsets of A; for each J € H, then for each element Z 
of lim, B;, where Z = (aie for some Byz € B; for each J, define a set 
Ez C lim, Ay by {as}, € Ez if and only if for some A € h, ay € B; for all 
J e A. Let (lim) B; := {Ez: Z elim, Bj}. 

Let X := lim, J,C := (lim),Cn J and fori = 1,2 let P; := (lim),P);. 
To see that each P; is a chain of subsets of X, we can take i = 1. Let W, Z € 
lim, P1,{By}yen € W, {Cj }jey € Z.Let J := {J €H: Bz C C3}. Then 
either J € hor J° e h. If J € h then clearly Ew C Ez. Otherwise, 7° € h, 
and since Pj; is a chain for each J, Cz C By for all J € 7° and Ez C Ew. 

It is easy to check that C C P, n P2. There is a natural 1-1 map i of X into 
X by {ay}, € i(x) if and only if for some A € h, ay = x for all J € A. So 
we can view X as a subset of X. Each P; M X is a chain of subsets of X, and 


CCCNX C(PiNX)N(P2N X), 


completing the proof. 


Section 4.4 describes the structure of classes C with S(C) = 1, but the 
structure of VC classes with S(C) = k for k > 1 apparently is not known in 
general. Smoktunowicz (1997) showed that the class £ of lines in the plane, 
a VC class with S(£) = 2, cannot be obtained from finitely many VC classes 
of index 1 and finitely many applications of the operations MN, Ul, and taking 
complements. Theorem 4.46 reduces the proof from “VC classes of index 1” 
to “chains.” 


4.6 Probability Laws and Independence 


Let (X, A, P) be a probability space. Recall the pseudo-metric dp(A, B) := 
P(AAB) on A, where AAB := (A \ B)U(B \ A). Recall also that for any 
(pseudo-) metric space (S,d) and € > 0, D(e, S,d) denotes the maximum 
number of points more than € apart (Section 1.2). 


Definition. For a measurable space (X, A) (a set X and a o -algebra A of subsets 
of X) and C C A let s(C) := inf{w : there is a K = K(w, C) < oo such that 
for every law P on Aand O < e < 1, D(e,C, dp) < Ke™”}. 
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This index s(C) turns out to equal the density: 
Theorem 4.47 For any measurable space (X, A) andC C A, dens(C) = s(C). 


Proof. Let P be a probability measure on A. Suppose A1, ..., Am € C and 
dp(Aj, Aj) > € > Ofor 1 <i < j <m, withm > 2. Let Xj, X2,... be iid. 
(P), specifically coordinates on a countable product of copies of (X, A, P). 
Then forn = 1,2,..., 


Pr{for some i Æ j, X; ¢ A; AA; forallk <n} < m(m — 1) —«)"/2 < 1 


for n large enough, n > —log(m(m — 1)/2)/logd — €). Let P, := 
1 >", ôx, (as usual) be empirical measures for P. For such n, there is pos- 
itive probability that P,(A;A^A;) > 0 for all i # j, and so A; and A; induce 
different subsets of {X;, ..., X,} and m? (n) > m. For any r > dens(C) there is 
an M = M(r,C) < 00, where we can take M > 2, such that m€ (n) < Mn” for 
all n. Note that — log(1 — £) > £e. Thus for m > 2, m < M(2 log(m® y e7", or 
m(logm)™ < Mye~" for some Mı = M,(r, C) = 4" M. For any 6 > 0 and C 
large enough (log m)” < Cm’ for all m > 1, so for all m > 0, m!~> < Mze™" 
for 0 < ¢ < 1 for some M) = M2(r, C, 5). Thus m < (Me~")!/"-®), Letting 
r} dens(C) and 60 gives dens(C) > s(C). 

In the converse direction, let |A| := card(A). Since it is not the case that 
m¢(n) < kn' for all n > 1, forr < t < dens(C) and k = 1,2,..., let Ay C X 
with A, # Øand |A, NC| > k|Ax|’. Then |A| > coask — oo.Let Bo := Aj. 
Other sets B; will be defined recursively. Let Bo, ..., B,—1 be disjoint subsets 
of X and let C(n) := Uozj<n Bj. Let 


k(n) := 2C, 


Let B, := Akn) \ C(n). So all the B, are defined and disjoint. Each set in 
B,C is induced by at most 2'©! different sets in Axm) MC. Thus 


[Bn nC] > [Arm TCI > 2 Arm] = 2° Bal’ 
Since B,„ is nonempty, it follows that 
|B, C| > 2”, andhence |B,| > 2”. 
Let 
[o0] 
On = |B, “7, S:= YS |B|! < œœ. 
n=0 


Let P be the probability measure on J, B, C X giving mass @,/S to each 
point of B, for each n. The distinct sets in B, NC are at dp-distance at least 
a, /S apart, and so are a set of elements of C which induce the subsets of B,. 
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So for all n, 


D(a, {@S),; C,dp) > |B nC] = 2”|B, = 2% a," = 2 Slan S. 
Fore := a,/(2S) > 0 as n — œ, this implies 
D(e,C, dp) > 2 (28S) "e™. 


Since 2” + oo as œ 4ļ0 this implies r < s(C). Then letting r + dens(C), the 
proof is complete. 


Corollary 4.48 For any probability space (X, A, P) and VC class C C A, the 
class F = {14 : A € C} is pregaussian for P. 


Proof. Take any r > S(C). By Corollary 4.4 and Theorem 4.47, there is a 
constant K < +00 such that D(e,C, dp) < Ke" for 0 < € < 1. For f and 
gin £°(P) and pp as defined in Section 3.1 we have príf, g) < ep(f, g) := 
({(f — g)°dP)'”. For A, B € A, we have ep(14, 1p) = dp(A, B)"?. It fol- 
lows that for 0 < e < 1, 


D(e, F, pp) < D(e, F, ep) < D(e*,C, dp) < K’e™™. — (4.10) 


As the set F is bounded with respect to ep, there is an M < oo such that 
for e > M, Die, F, pp) = Die, F, ep) = 1. Thus the integral in the metric 
entropy sufficient condition for the GC property, Theorem 2.36, is finite, using 
the last sentence over M < € < œo where log D(e) = 0, and (4.10) for0 < £ < 
M where ./log(1/é) is integrable. So F is pregaussian. 


To get the Donsker property, however, will require additional measurability 
assumptions (Chapters 5 and 6). 

There is a notion of independence for sets without probability. To define 
it, for any set X and subset A C X let A! := A and A`! := X \ A. Sets 
A1, ..., Am are called independent, or independent as sets, if for every func- 
tion s(-) from {1,..., m} into {—1, +1}, ‘j= a Æ Ø. Such intersections, 
when they are nonempty, are called atoms of the Boolean algebra generated 
by Aj,..., Am. Thus for A;,..., Am to be independent as sets means that 
the Boolean algebra they generate has the maximum possible number, 2”, of 
atoms. 

If A,,..., Am are independent as sets, then one can define a probability law 
on the algebra they generate for which they are jointly independent in the usual 
probability sense and for which P(A;) = 1/2, i=1,...,m. For example, 
choose a point in each atom and put mass 1/2” at each point chosen. Or, if 
desired, given any g;, 0 < q; < 1, one can define a probability measure Q for 
which the A; are jointly independent and have Q(A;) = qi, i =1,...,n. 

For a set X and C C 2* let 


I(C) := sup{m: Ai, ..., Am are independent as sets for some A; € C}. 
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Theorem 4.49 For any set X, C C 2%, and n =1,2,..., if S(C) > 2", then 
I(C) > n. Conversely if I(C) > 2”, then S(C) > n. So I(C) < œ if and only if 
S(C) < œ. In both cases, 2” cannot be replaced by 2” — 1. 


Proof. Clearly, if a set Y has n or more independent subsets, then |Y| > 2”. 
Conversely if |Y| = 2”, we can assume that Y is the set of all strings of n digits 
each equal to 0 or 1. Let A; be the set of strings for which the jth digit is 1. 
Then the A; are independent. It follows that if S(C) > 2”, then [(C) > n, while 
if |Y| = 2” — 1 and C = 2", then S(C) = 2” — 1, while /(C) < n as stated. 

Conversely if B; are independent as sets for j = 1,...,2”, Bj €C, let 
A(i) := A; be independent subsets of {1,...,2”} for i = 1,...,n. Choose 
xi € Njan B; N Ngao X \ B;). Then x; € B; if and only if j € A(i). For 
each set S C {1,..., n}, 


ania = 
ieS igs 

for some j := js € {1,...,2”}. Then j € A; if and only if i € S, and 
Bj; O{x1,...,Xn} = {x; : ies}. 


So C shatters {x1,..., Xn} and S(C) > n, as stated. If C consists of 2” — 1 
(independent) sets, then clearly S(C) < n. 


Forany set X, C C 2* and Y C X,recallthatCy := YNC:={YNC:Ce 
C}. Let At(C|Y) be the set of atoms of the algebra of subsets of Y generated by 
Cy, where in the cases to be considered, Cy will be finite because C or Y is. Let 
Ac(Y) := |At(C|Y)| be the number of such atoms. Let m2 (n) := sup{A,(Y) : 
ACC, |A| <n} < 2”. Let 


dens*(C) := inf{s > 0: forsome C < oo, me (n) < Cn’ for all n}. 

For any x € X let Cy := {A E€ C : x € A}. Let Cy := {Cy : y € Y}. 
Theorem 4.50 For any set X and A C C C 2%, with A finite, 
(a) A(X) = A%(A). 
(b) Forn = 1,2,..., m(n) = m® (n). 
(c) S(CY) = I(C). 
(d) dens*(C) = dens(C%) < I(C). 
Proof. For any B C A let 

a(B) := (Bn N (X \ A). 
BeB Ace A\B 


Then a(B) is an atom, in At(A|X), if and only if it is nonempty. Now y € a(B) 
if and only if ANC, = B, so (a) follows. Then taking the maximum over A 
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with |A| = n on both sides of (a) gives (b), which then implies (c) and (d). The 
last inequality follows from Corollary 4.4. 


4.7 Vapnik-Červonenkis Properties of Classes of Functions 


The notion of VC class of sets has several extensions to classes of functions. 


Definitions. Let X be a set and F a class of real-valued functions on X. Let 
C c 2*. If f is any real-valued function, each set {f > t} for t € R will be 
called a major set of f. The class F will be called a major class for C if all the 
major sets of each f € F are in C. If C is a Vapnik—Cervonenkis class, then F 
will be called a VC major class (for C). 

The subgraph of a real-valued function f will be the set {(x,f)e X xR: 
0<t< f(x) or f(x) <t <0}. If D is a class of subsets of X x R, and for 
each f € F, the subgraph of f is in D, then F will be called a subgraph class 
for D. If Dis a VC class in X x R, then F will be called a VC subgraph class. 

Recall from Section 3.11 the definitions of symmetric convex hull H(F, M) 
(times M) and its sequential pointwise closure H,(F, M). A class F of func- 
tions such that F c H,(G, M) for some M < œœ anda given G will be called 
a VC subgraph hull class if G is a VC subgraph class, and a VC hull class if 
G = {lc : C €C} where C is a VC class of sets. 


So there are at least four possible ways to extend the notion of VC class to 
classes of functions. Some implications hold between these different conditions, 
but no two of them are equivalent. The next theorem deals with some of the 
easier cases of implication or non-implication. 


Theorem 4.51 Let F be a uniformly bounded class of nonnegative real-valued 
functions on a set X. Then 


(a) If F is the set of indicators of members of a VC class of sets, then F is also 
a VC major class, a VC subgraph class, and a VC hull class. 

(b) If F is a VC major class then it is a VC hull class. 

(c) There exist VC hull classes F which are not VC major. 

(d) There exist VC subgraph classes F which are not VC major. 


Remark. There are VC major classes which are not VC subgraph, as seen in 
Problem 10 and a remark at the end of Section 4.8. 


Proof. (a): The indicators of a VC class C of sets clearly form a VC major class 
(for C) and a VC hull class. Also, the class of sets A x C for C € C, fora fixed 
set A, here [0, 1], form a VC class for example by Theorem 4.34. So (a) holds. 
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(b): Let F be a VC major class for a VC class C of subsets of X. Let | f(x)| < 
K < œ for all f ¢ F and x € X, with K > 0. Then for any f € F, {x : 
f(x) > —2K} = X,so X EC. 

For any f € F, let g := (f + K)/(2K). The class of all such g is clearly 
VC major for the same C. We have 0 < g < 1. Let 


1 n—-1 n—-1 j 
8n = = 2 les) = È lumen sutyin- 


j=l j=0 


Then g,(x) > g(x) as n —> o for all x. [In fact, g(x) — 1/n < g(x) < g(x) 
for all x, so g, — g uniformly.] For each n let fa = 2Kg, — K1x. Then as 
n —> œ, f,(x) > f(x) for all x, and f, € H({14 : A €C},2K). So F is VC 
hull (for the same VC class C). 


(c) Let X = R?. Let C be the set of open lower left quadrants {(x, y): x < 
a, y < b} forall a,b € R. Then S(C) = 2 by Corollary 4.44. Let F be the set 
of all sums 77°, 1cœ)/2* where C(k) € C. Then clearly F is a VC hull class. 
The sets where f > 0 for f in F are exactly the countable unions of sets in C. 
But such unions are not a VC class; for example, they shatter any finite subset 
of the line x + y = 1. So F is not a VC major class, and (c) is proved. 

(d): Let fn := n nlp, forn = 1,2,..., for any measurable sets B,. Then 
Jn\9, so the subgraphs of the functions f, are linearly ordered by inclusion 
and form a VC class of index 1 (Theorem 4.10(a)). Now {fn > 1/n} = B, for 
each n, and the sequence {B,,} need not form a VC class; for example, let B, be 
a sequence of independent sets (see Section 4.6). Then { fa} is not a VC major 
class, proving (d). 


4.8 Classes of Functions and Dual Density 


For a metric space (S, d) and £ > 0 recall D(e, S, d), the maximum number 
of points more than € apart. For a probability measure Q and 1 < p < co 
we have the L? metric dp o( f, 9) := (UJ |f — g|?dQ)!/?. For a class F C 
LP(Q) let De, F, Q) := De, F, dp,o). Let D” (e, F) be the supremum 
of D(e, F, dp o) over all laws Q concentrated in finite sets. 

If F is a class of measurable real-valued functions on a measurable space 
(X, A), let Fr(x) := SUP feF | f(x)|. Then a measurable function F will be 
called an envelope function for F if and only if FF < F. If Fr is measurable 
it will be called the envelope function of F. For any law P on (X, A), F% is an 
envelope function for F, which in general depends on P. 

Given F, an envelope function F for it, ¢ >0 and 1 < p < œ, let 
DY? (e, F, Q) be the supremum of m such that there exist fi,..., fn € F 
for which f |f; — fj)|\?dO > eP f F?dQ for alli £ j. 
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The next fact extends part of Theorem 4.47 to families of functions. The 
proof is also similar. 


Theorem 4.52 Let 1 < p < œ. Let (X, A, Q) be a probability space and F 
be a VC subgraph class of measurable real-valued functions on X. Let F have 
an envelope F € L?(X, A, Q) with O < f FdQ. Let C be the collection of 
subgraphs in X x R of functions in F. Then for any W > S(C) there is an 
A < œ depending only on W and S(C) such that 


DP (e, F, Q) < AQ?!/e?)" for0<e<1. (4.11) 


Proof. We can assume that F > 1 everywhere on X. Given 0 < e < 1, take a 
maximal mand fi, ..., fm asin the definition of DP. First, suppose p = 1. For 
any measurable set B let (F Q)(B) := fz FdQ. Let Qr := FQ/QF where 
QF := f FdQ, so Qr is a probability measure. Let k = k(m, £) be the small- 
est integer such that el > ('). Then k < 1 + (4log m)/e. Let X1, ..., Xx 
be i.i.d. (Qr). Given X;, let Y; be uniformly distributed on the interval 
[—F(X;), F(X;)], and such that the vectors (X;, Y;) € R? are independent for 
i = 1,...,k. Let C; be the subgraph of fj, j = 1,...,m. Then for all i and 
J#S 

Pr((X;, Yi) € Cj ACs) 


SUF XD = fs XD/2F (Xi) 1d Q r(Xi) 
S\fi — fsldQ/AS FdQ) > &/2. 


Thus by independence 


< e */? and 


k 
Pr((X;, Yi) ¢ Cj AC, foralli=1,...,k) < (1 = 5) 


Pr{(X;, ¥i) ¢ Cj; AC, for alli =1,...,k and some j # s} < Ge“? <1. 


Thus with positive probability, for all j 4 s there is some (X;, Y;) € C;AC,. 
Fix X; and Y;, i =1,...,k, for which this happens. Then the sets C; N 
{(X;, a) ae are distinct for all j, so m°(k) > m. 

Let S := S(C). By the Sauer and Vapnik—Cervonenkis lemmas (Theorem 
4.2 and Proposition 4.3), m°(k) < 1.5k5/S! for k > S + 2. It follows that for 
some constant C depending only on S, m°(k) < Ck® for all k > 1, where 
C < 25+! — 1. We can assume that C > 1. So 


m < Ck’ < C(1+4logm)/e)’. 


For any a > Q there is an mg such that 1 + 4logm < m“ for m > mo and 
then m!=*S < C/e%. Choosing a small enough so that aS < 1 we have 
m < C/e5/"-#S) form > mg. For any W > S we can solve W = cs bya = 
‘>. Then m < Ae~™ for A := max(mpo, C), and then mo and A are functions 
of W and S, finishing the proof for p = 1. 
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Now suppose p > 1. Let Qr p := F?~'Q/Q(F?~'). Then fori Æ j, 
eP fF'dQ < Sifi- fld < Sifi- F\2F)’ "de 
= S\fi— fldQer,p- QF), 
and Q2F,p = Qr,p. Thus by the p = 1 case, 
DPF, O) < DP, F, Orp) < Ad” 


where ô := eP O(F?)/[O(F)O(2F)?~'). Now by Hélder’s inequality, OF = 
OCF - 1) < (Q(?))" and OCF?) = OCF? «1) < (O(F’ je Y/?. So 
ô > eP /2P7! and the conclusion follows. 


The following fact is a continuation of Theorem 4.51. 
Theorem 4.53 Let F be a uniformly bounded class of functions on a set X. 
(a) If F is a VC subgraph class, then 
For somer < œ and M < œ, D®(e, F) < Me™ for 0<e <1. (4.12) 


(b) There exist classes F satisfying (4.12) which are not VC hull. 
(c) There exist VC subgraph classes which are not VC hull. 


Proof. (a) There is a finite constant envelope function K for F, so for any 
fig < F andlaw y, S(f — g}dy < 2K f |f — g|dy. Thus (a) follows from 
Theorem 4.52 with r = 2W. 


(b) It will be shown that there exist sequences F = {bn 14, }n>1, where we can 
take b, = 1/n” for any positive integer v, such that F is not VC hull. Clearly 
such a sequence will satisfy (4.12). For this two lemmas will be helpful: 


Lemma 4.54 Let Aj,..., An be jointly independent events in a probability 
space (X, P) with P(A;) = 1/2, i=1,...,n. Let Bi,..., B, be any events. 
Let D:= repens A; AB,. Then the algebra B generated by B,,..., B, has at 
least 2"(1 — P(D)) atoms. 


Proof. For any F C {1,..., n} let 
Ar = Nn N XA; 
JEF JF, j<n 


and define Bp likewise. The atoms of $ are those Br which are nonempty. For 
each F, P(A) = 1/2”. If a point of Ar is notin D, then it is also in Br, which 
then is nonempty. Since D can include at most 2” P(D) of the 2” events Ar, 
the Lemma follows. 


Lemma 4.55 Suppose Aj; are independent events with P(A;) = 1/2 for all j, 
and C is a class of events such that for some K < œ and u < œ, for each 
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j there is an event D; such that P(A;ADj;) <j, where Dj; is in an algebra 
generated by at most Kj" elements of C and par nj < 1. Then C is not a VC 
class. 


Proof. Let a := 1 — ja nj > 0. By Lemma 4.54, for each m = 1,2,..., 
the algebra Dm generated by D,,..., Dm has at least 2”a@ atoms. On the 
other hand, D,, is generated by at most ae Kj" < K e x“dx < K(m + 
1)“+! /(u + 1) sets in C. By Theorems 4.49 and 4.50(c), C% is a VC class and 
then by Theorem 4.50(b) and Corollary 4.4, there are t < oo and C < oo such 
that the number of atoms of the algebra generated by k elements of C is at 
most Ck’. Then 2a is bounded above by a polynomial in m of degree at most 


(u + 1)t, a contradiction, so Lemma 4.55 is proved. 


Now to prove Theorem 4.53(b), let X := [0,1], let P := U[0, 1] be the 
uniform (Lebesgue) law and let A,, be independent sets with P(A,,) = 1/2. 
Let v be a positive integer and F := {1l4,,/m"}n>1. Then (4.12) holds for F. 
Suppose F is a VC hull class for some C. 

Suppose P(A) = 1/2 and thata + b14 € A,(C, M) for some a > 0, b > 0 
and 0 < M < oo. We can take M = 1, replacing a by a/M and b by b/M. 
Let r := S(C) < œ. Let dp(C, D) := P(CAD) for measurable sets C, D. 
Then for any w > r, specifically for w := r + 1, there is some C < oo with 
De, C, dp) < Ce—” for 0 < £ < 1 by Corollary 4.4 and Theorem 4.47. Given 
0 < B <1, take Cj eC and t; with ae |t;| < 1 such that P(jJa+bl, — 
i tjlc;|) < B. Choose D; € C,i=1,...,m, wherem < CB~"—!, such that 
foreach j thereis ani := i(j) with P(C; AD;) < B.Let f := 2 tj lD. Then 
P(ja+ bl, — f|) < 2p. Let B := {f > a + b/2}. Then B is in the algebra 
generated by D1, ..., Dm, and P(AAB) < 46/b. 

Apply what has just been shown to A = A, for k =1,...,m, with a = 
ag = 0, b = b; = 1/k”, and B = p; = 1/(8k?+”), Then there are sets B = B, 
and m = mg < Nk” for some N < oo where u := (r + 1)(2+ v) < œo. To 
apply Lemma 4.55, let nų = 1/(2k?) to get a contradiction. So (b) is proved. 

For (c), let a = a, = 1/k and b = by = 1/k? to get a VC subgraph class 
as in the proof of Theorem 4.51(d). The same argument as for (b), now with 
v = 2, again gives a contradiction, so (c) holds. 


It will be shown in Proposition 10.12 below that there are VC major (thus 
VC hull) classes which do not satisfy (4.12) (and so, in particular, are not VC 
subgraph classes). 


Problems 


1. Let C be the class of all unions of two intervals in R. Evaluate S(C). Hint: 
Try it first directly; if you like, look at the more general Problem 11. 
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2. If S(C) = 3 find the upper bounds for m° (n) given by Theorem 4.2 and by 
Proposition 4.3. 

3. Show that for dens(C) = 0, S(C), which is finite by Corollary 4.4, can be 
arbitrarily large. Hint: Let C be finite. 


4. Find the smallest n such that there is a set X with |X| = n and C C 2% with 
S(C) = 1 where neither (a) nor (b) in Theorem 4.10 holds. 

Hints on Problems 5-7: If C is a collection of convex sets in Rf and shatters 
a set F, then no point in F is in the convex hull of the other points. Then, the 
convex hull of F is a polyhedron of which each point of F is a vertex. In the 
plane, it is a polygon. To get a lower bound S(C) > k it is enough to find one 
set of k elements that is shattered. Try the vertices of a regular k-gon. To get 
upper bounds, use facts such as Theorem 4.6 and Proposition 4.36. 


5. Let C be the set of all interiors of ellipses in R?, with arbitrary centers and 
semiaxes in any two perpendicular directions. Give upper and lower bounds 
for S(C). 


6. A half-plane in IR? is a set of the form {(x, y) : ax + by > c} for real a, b, c 
with a and b not both 0. Define a wedge as an intersection of two half-planes. 
Let C be the collection of all wedges in R?. Show that S(C) > 5. Also find an 
upper bound for S(C). 


7. Let C be the set of all interiors of triangles in R?. Show that S(C) > 7. Also 
give an upper bound for S(C). 


8. Show that the lower bounds for S(C) in Problems 6 and 7 are the values of 
S(C). Hint: For a convex polygon, the set F of vertices can be arranged in cyclic 
order, say clockwise around the boundary of the polygon, v1, v2,..., Un, V1. 
Show that if a half-plane J contains v; and v; withi < j, then it includes either 
{u;, Vigi,-.-, Vj} OF {U}, Vj41s -++ Un, U1, -.., Vi}. Thus find what kind of set 
the intersection of J and F must be. From that, find what occurs if two or three 
half-planes are intersected (or unioned, via complements). 


9. In the example at the end of Section 4.3, for each set A C X with 3 elements, 
find a specific subset of A notin ANC. 


10. Let F be the class of all probability distribution functions on R. Show 
that F is a VC major class but not a VC subgraph class. Hint: Show that the 
subgraphs of functions in F shatter all sets {(x;, yi Viet with x1 < +--+ < Xn 
and0 < yi <---<y, <1. 


11. Let C(j) be the class of all unions of j intervals in R for j = 1,2,.... Show 
that S(C(j)) = 2j for all j and that for any finite set F C R with |F| = n we 
have AC)(F) = „C22; (the largest possible value by Sauer’s Lemma). Hints: 
One can take F = {1,2,...,n}. For A C F and x, y € F let x =4 y mean 
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that A N {x, y} = Ø or {x, y}, otherwise x A, y. If AAG let jı := ji(A) 
be the least element of A. If j,(A),..., j,(A) are defined, let j,+;(A) be 
the least j > jg(A) such that j #4 jg(A) and j <n, if there is such a j. 
Show that there is a 1—1 correspondence between subsets A C F and finite 
sequences (jj, j2,..., jr) for r =r(A) = 1,...,n where r(@) := 0. Show 
that A € C(j)n F if and only if r(A) < 2j. 


12. For each d = 1,2,..., let Py be the vector space of all polynomials of 
degree at most d on R. 


(a) Show that pos(P4) shatters all sets F of d + 1 points in R. (It shatters no set 
of d + 2 points, by what fact?) Hints: If for a given finite set F, a polynomial 
f >0Oonaset AC F and f <0 on F \ A, then for some ¢ > 0, f — e is 
still > 0 on A, is a polynomial of the same degree, and is <0 on F\ A. 
Let F = {x,,...,Xa41} where x1 < x2 < --- < Xq+1. If for some consecutive 
points x; < xj4,; with j =1,...,d, one of x; and xj+ı is in A but not the 
other, the polynomial f should have a simple zero somewhere in the interval 
(xj, xj+1). Find a polynomial f with just such zeroes, show that its degree is 
at most d, and take — f if necessary, then show that f(x) > 0 for x € F if and 
only if x € A. 


(b) Prove directly, without using an earlier fact, that positivity sets of functions 
in P4 cannot shatter a set F of d + 2 points. Hint: Let A be the set of all x; € F 
with j odd, and show that the argument in part (a) can be reversed. 


A set of n distinct points of Rf is said to be in general position if no k + 2 
of them are in any k-dimensional hyperplane for k = 1,...,d — 1. 


13. (a) Recall the definition of half-planes in the plane from Problem 6. Show 
that the collection H(2) of all half-planes in the plane shatters every set of three 
points in general position, but not every set of three points. 

(b) An open half-space in R? will be a set of the form {x := yia > Co + 
Fi cjxj > 0} where not all c; for 1 < j < d are 0. In R?, we know (why?) 
that the set H(3) of all half-spaces shatters some set of 4 points. Show that 
every set of 4 points in general position is shattered. 

(c) Give an example of a set of 4 points in R3, no three of which are in any line, 
but that is not shattered. 


14. It is a known fact, although not proved in this book, that for the class H(d) of 
all half-spaces in Rf, m™©(n) = Py(n) := 2n—1C<a. Let Qa(n) := Cea, 
which is the largest possible value of m“ (n) for VC classes C with S(C) = d + 1. 
In this case S(H(d)) = d + 1. 


(a) Evaluate the two polynomials P(n) and Q2(n) explicitly. Show that they 
both equal 2” for n = 1, 2, and 3 but not for n = 4. 


16:42 


P1: KNP 


CUUS2019-04 CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


Notes 211 


(b) Find an example of a set F of four points in R? such that the number 
A#Q(F) of sets A N F for A € H(2) achieves its maximum P,(4). Hint: Take 
the vertices of a square. 


(c) If the four points in F are the vertices of a triangle and a point in the interior 
of the triangle, then what is the maximum? 


Notes 


Notes to Section 4.1. The definitions of Aĉ, m® and (in effect) V (C) appeared 
in the announcement by Vapnik and Cervonenkis (1968). In their 1971 paper 
they had a weaker form of Theorem 4.2 with “> ,C<;,” instead of “> ,C<x-1.” 
The theorem as stated appears in Sauer (1972, Theorem 1). The quantities ,,C<; 
first appeared in mathematics, to my knowledge, for general n > k, in work 
of Schläfli (1901), who showed that this is the maximum number of open 
regions into which R* is decomposed by n hyperplanes {x : (v jax) =c;} 
where v; Æ 0 in RÉ, cj € R, and (v, x) := v - x is the usual inner product. 
Schläfli showed that the maximum is attained when the hyperplanes are in 
general position, meaning in this case that any k or fewer of the v; are linearly 
independent. Steiner (1826) had proved these facts for k < 3. Cover (1965), 
Harding (1967), and Watson (1969) considered mn) and showed that it is 
< 2,-1C<,x. Schläfli (1901, p. 211), for the case c; = 0 of (k — 1)-dimensional 
linear subspaces, and the class Ho(k) of half-spaces bounded by them, had 
shown that m*0(n) < 2n—1C<—1, also attained when the subspaces are in 
general position in the same sense. 

Steele (1978b, and earlier in his 1975 Ph.D. thesis) seems to have coined 
the term “shatter.” 

“Pascal’s triangle,” published by Pascal in 1653, had been known since much 
earlier times. Some sources given are the Persian mathematicians Al-Karaji 
(953-1029) and the later but more famous Omar Khayyam (1048-1131); the 
Chinese mathematicians Yang Hui (whose name is said to be used in China) 
and Jia Xian (1010-1070); and in India, Halayudha (about 975) who referred 
to a book, apparently now lost, by an author some 1,000 years earlier. 

Extended forms of Sauer’s lemma for possibly infinite cardinals are said to 
have been found independently by Shelah (1972). Do Shelah’s results imply 
Sauer’s lemma? Steele (1978a) cites Sauer’s paper and not Shelah’s (in what he 
later said, informally on the Web, was a conscious decision after looking at both 
papers). I also do not use the name “Sauer—Shelah” lemma, having also looked 
at both papers, and also not claiming fully to understand the mathematical logic 
content of Shelah’s paper. Later, some authors have added the name Perles to 
Shelah’s, but I could not find a relevant publication in their reference lists or 
elsewhere. 
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Vapnik and Cervonenkis in their 1974 book also gave Theorem 4.2, then 
proved Proposition 4.3. Assouad (1981, (3.2); 1983, Proposition 2.4) defined 
“dens” and noted Proposition 4.5. 


Notes to Section 4.2. Theorem 4.6, when H is the space of linear functions on 
R” and f = 0, is a classical fact known as Radon’s Theorem, proved by Radon 
(1921, p. 114) and reviewed by Danzer, Griinbaum, and Klee (1963, p. 103). 
Theorem 4.6 for general H and f = 0 appeared in Dudley (1978, Theorem 
7.2); Wenocur and Dudley (1981) proved it for any f. Assouad (1983) noted 
Theorem 4.7. 


Notes to Section 4.3. This section is based on the paper Dudley (1985b). 


Notes to Section 4.4. This section is also largely based on Dudley (1985b), 
except that Theorem 4.30 was added, just after the first edition of this book was 
published (1999), in Dudley (2000). Theorem 4.21 and other characterizations 
of trees are given in Harary (1969), pp. 32-33. 


Notes to Section 4.5. Some of the facts in this section are from Dudley (1984, 
Section 9.2). Those on the density, Theorems 4.31 and 4.33, are due to Assouad 
(1983). Assouad also told me Proposition 4.36 in a letter. Theorem 4.41 and 
Corollary 4.44 are from Wenocur and Dudley (1981). Theorems 4.30 and 4.38, 
and the case k = 3 of Proposition 4.39, appeared in Dudley (2000). Theorem 
4.46 appeared in Smoktunowicz (1997). 


Notes to Section 4.6. Theorem 4.47 is partly in Dudley (1978, Section 7) and 
was Stated in the current form by Assouad (1981, 1983). Haussler (1995) gave a 
sharper form. Assouad (1983) also proved Theorem 4.49, first defined dens*(-), 
and proved Theorem 4.50. 


Notes to Section 4.7. VC subgraph classes have been called “VC graph” 
classes (Alexander 1984, 1987) or “polynomial classes” (Pollard 1984, pp. 17, 
34; 1985). This section is based on a small part of Dudley (1987). 


Notes to Section 4.8. Theorem 4.52 is essentially Lemma 25 of Pollard (1984, 
p. 27) and had appeared earlier in Pollard (1982). Theorem 4.53(b) and (c) are 
based on Section 3 of Dudley (1987). 
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Measurability 


The example after Theorem 3.2 showed that for a continuous distribution 
function F such as for U[0, 1], the set of all possible functions ./n(F, — F), 
even for n = 1, is nonseparable in the sup norm, and all its subsets are closed, 
including those corresponding to nonmeasurable sets of possible values of the 
observation X;. Therefore, the classical definition of convergence in law, or 
weak convergence, which works in separable metric spaces, does not work in 
this case, So, in Chapter 3, functions f* and upper expectations E* were used 
to get around measurability problems. 

But, in the classical Glivenko—Cantelli theorem, saying that sup, |(F; — 
F)(x)| > 0 almost surely as n — oo for any distribution function F on R 
and its empirical distribution functions F„ (RAP, Theorem 11.4.2), there is no 
measurability problem. The supremum is measurable, as it can be restricted 
to rational x by right-continuity of F, and F. The collection C of left half- 
lines (—oo, x] is linearly ordered by inclusion and so has S(C) = 1, and for 
it, not only the Glivenko—Cantelli theorem but, after suitable formulations 
(Theorem 1.8 or, less specifically, Chapter 3), the uniform central limit theorem 
(Donsker property) holds for any probability measure P on the Borel sets 
of R. 

Another example of a class C linearly ordered by inclusion will be given, 
where the Glivenko—Cantelli property actually fails, although sup,¢c (Pn — 
P)(A)| is measurable (it is identically 1). The notion of linear ordering was 
defined before Theorem 4.17. A linearly ordered set (X, <) is said to be well- 
ordered iff every nonempty subset contains a least element. Thus, for example, 
with usual orderings, the set N of nonnegative integers is well-ordered, but the 
set Z of all integers is not. It follows from the axiom of choice that every set can 
be well-ordered (Zermelo’s theorem, RAP, Theorem 1.5.1), and in particular, 
there exist well-ordered sets (X, <) such that X is uncountable. 


213 


14:43 


P1: KNP Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


CUUS2019-05 


CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


214 5 Measurability 


Example. This is related to the “ordinal triangle” counterexample in integration 
theory, showing why measurability is needed in the Tonelli-Fubini Theorem 
on Cartesian product integrals. Let (Q, <) be an uncountable well-ordered set 
such that for each x € Q, the initial segment Iy := {y: y < x} is countable. 
(In terms of ordinals, Q is, or is order-isomorphic to, the least uncountable 
ordinal.) Let S be the o-algebra of subsets of Q consisting of sets that are 
countable or have countable complement. Let P be the probability measure on 
S which is 0 on countable sets and 1 on sets with countable complement. Then 


SS lyaxdP(y)d P(x) = 0 < ff lyazxdP(x)dP(y) = 1. 


Since all other hypotheses of the Tonelli—Fubini theorem hold, the function 
(x, y) +> ly<x must not be measurable for the product o-algebra, even if S is 
replaced by any larger o-algebra of subsets of Q to which P can be extended. 
For example, according to the continuum hypothesis, we could take Q to be 
[0, 1] (where the well-ordering is unrelated to the usual ordering), and P to be 
Lebesgue measure or any other nonatomic law on [0, 1]. 

Now, consider the class C of sets 7, for each x € Q. Each of these sets is 
countable by assumption. The sets are linearly ordered by inclusion since < 
is a linear ordering. Thus S(C) = 1 by Theorem 4.10. But, C is not a weak or 
strong Glivenko—Cantelli class as defined in Section 3.3 (still less a Donsker 
class), since for any possible X;,..., Xn, a maximum x := max(X),..., Xn) 
for the well-ordering exists, so P,(/,.) = 1 while P(J,) = 0, so supyec (Pan — 
PXA) = 1. 

To find hypotheses that will avoid problems as in the example just given, 
suppose we have a class F of functions on X, where (X, A) is a measur- 
able space, for example F = {14 : A € C} for a class C C A. The mapping 
(f, x)= f(x) seems very natural, but is it jointly measurable in a useful sense? 
Sometimes it is useful to consider a kind of parametrization, via a measurable 
space (Y, B), with a mapping y +> fy from Y onto F. We will want the map- 
ping (x, y) +> fy(x) to be jointly measurable on X x Y. This property, called 
(image) admissibility, will be the topic of Section 5.2. In the above examples, 
for the usual ordering < of R, the set {(x, y): x < y} is jointly measurable 
(Borel, in fact closed in R*, and any open set in R? is a countable union of 
open rectangles), but we saw that in an uncountable well-ordered set (X, <), 
{(x, y): x < y} was not jointly measurable in X x X. A convenient condition 
to assume on the parametrizing set Y is that it is a Suslin measurable space, a 
property to be defined and treated, in combination with admissibility, in Sec- 
tion 5.3. The admissibility and Suslin properties will hold for classes of sets 
(or functions) encountered in practice. It will be shown in Chapter 6 that these 
conditions, together with the Vapnik—Cervonenkis or related properties, are 
enough to imply Glivenko—Cantelli and Donsker properties. 
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5.1 Sufficiency 


This section could nearly be starred, as it is referred to later only once in this 
chapter and once in Chapter 6. 

Suppose that a probability measure P is known to be in a certain family P 
of laws and we have observed X4, ..., X, i.i.d. (P), but nothing else is known 
about P. A statistic, T, which is a measurable function of X1, ..., Xn, will 
roughly speaking be said to be sufficient for P if, given T, no further information 
about X;,..., Xn is useful in making decisions or inferences about P € P. A 
precise definition is given below. This section will show that the empirical 
measure P, is sufficient even when P is the family of all probability measures 
on a measurable space. 

Note that P, is a symmetric function of the X; in the sense that it is preserved 
by any permutation of the indices 1, ...,. Once P, is given, knowing that the 
X; were observed in a certain order will not help in making inferences about 
P. 

Here is the formal definition of sufficiency: let (S, B) be a measurable space 
(a set S and a o-algebra BG of subsets of S). Let Q be a set of probability 
laws on (S, B). A sub-o-algebra D of B is called sufficient for Q iff for every 
C € B there is some D-measurable function gc such that for every Q € Q, the 
conditional probability 


Q(C|D) = gc almost surely for Q. (5.1) 


The essential point is that gc does not depend on Q in Q. 

Most often, there will be some n > 1, a measurable space (X, A) and a 
family P of laws on (X, A) such that S is the n-fold Cartesian product X” with 
the product o-algebra B = A” and Q = P” := {P": P e P}, where P” is 
the n-fold Cartesian product P x P x --- x P (RAP, Theorem 4.4.6). 

The meaning of sufficiency is clarified by the factorization theorem, to be 
proved next. A family P of probability measures on a measurable space (S, B) 
is said to be dominated by a measure n if every P € P is absolutely continuous 
with respect to u. Then we have the density (Radon—Nikodym derivative) 
dP/d (RAP, Section 5.5). 

If there is a nonatomic law on (S, B), the family P of all laws on (S, B) is 
not dominated. Factorization is still useful in that case, in the proof of Theorem 
5.6 below. 


Theorem 5.1 (Factorization theorem) Let (S, B) be a measurable space, D 
a sub-o -algebra of B, and P a family of probability measures on B, dominated 
by a o-finite measure u. Then D is sufficient for P if and only if there is a 
B-measurable function h > 0 such that for all P € P, there is a D-measurable 
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function fp with dP/du = fph almost everywhere for u. We can take h € 
LYS, B, p). 


Proof. Two measures are called equivalent if each is absolutely continuous 
with respect to the other. Two families P and Q of probability measures on a 
o-algebra B are called equivalent if and only if for all B € B, (P(B) = 0 for 
all P € P) is equivalent to (Q(B) = 0 for all Q € Q). 


Lemma 5.2 If a family P of probability measures on a measurable space 
(S, B) is dominated by a o-finite measure n, there is a countable subfamily of 
P equivalent to P. 


Proof. For each P € P, let fp be (a specific version of) the Radon—Nikodym 
derivative (density) dP/du and let Kp := {x: fp(x)>0}. A set KEB 
such that for some P € P, K C Kp and u(K) > 0, will be called a kernel. A 
chain is any union of disjoint kernels (necessarily a countable union). 

If u is not finite let S = 2 ı An Where A, are disjoint measurable sets 
with 0 < u(An) < œ. Let v(A) = ae WAN A,)/[2” W(A,)]. Then v and 
u are equivalent (mutually absolutely continuous), so replacing u by v, we can 
assume y is finite. 

Let C1, C2,... be chains such that sup, w(C,) = sup{u(C): Ca 
chain}. Let Dı := Cy. Given a chain D,_\, let C, = Ujs1 Knj where 
K,; are disjoint kernels (for n fixed). Let Da; := Knj \ Dn-1 if it is 
a kernel, i.e., if (Ky; \ Dn-1) > 0. Otherwise, let D,; be empty. Let 
D, := Dr UU; Daj, which is a chain. Let D := U,D,, a chain 
with maximal u. Then D = U, K,, for some disjoint kernels K,. Choose 
P := P, := P(n) in P such that K, C Kpn) and P,(K,) > 0. To show 
that {Pm} is equivalent to P, suppose not. Then for some B € B, P,(B) = 0 
for all n and P(B)>0 for some P e P. If P(B\ D)>0, then some 
kernel is included in B \ D, contradicting the choice of D. So we can 
assume B C D and for some n, P(B A K,,) > 0. Then P,(B A K,) > 0 since 
u(B AN K,,) > 0 and K, is a kernel for P,,. This contradiction completes the 
proof. 


Lemma 5.3 Any countable family of o -finite measures un is equivalent to one 
finite measure u. 


Proof. First, as in the last proof, we can assume u,(S) < 1 for all n. Then, let 
w= ee LL, /2", which is finite and equivalent to {fy },>1- 


The next step in the proof of Theorem 5.1 is: 


Theorem 5.4 Under the hypotheses of Theorem 5.1, the following are equiva- 
lent: 


(a ) D is sufficient for P; 
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(b) There is a probability measure x such that {À} is equivalent to P andd P /dÀ 
can be chosen to be D-measurable for all P € P; 

(c) There is a probability measure à which dominates P such that dP /dàÀ can 
be chosen to be D-measurable for all P € P. 


Proof. First it will be shown that (c) implies (a). For any B € B let fg := 
E,(1g|D). Then for any P € P and A € D, by RAP, Theorem 10.1.9, 


dP dP 
J fsdP = fE,0g|D)——di = flg——dà = P(ANB), 
‘A ‘A dì a dà 


so fg = Ep(1_|P) and D is sufficient. 

(a) implies (b): if D is sufficient, take a sequence {P,} C P, equivalent to P, 
by Lemma 5.2. Let à := 2 P, /2". For each B € B take fg = Ep(1g|D) 
P-a.s. for all P € P. Since 0< fs < 1 P-almost surely for all P in P, also 
0 < fg < 1 almost surely for à. Since P,(B N A) = f4 fed P, for all n and all 
A € D, it follows that A(B N A) = fa fed so fg = E,(1g|D). 

To show that for all P € P, dP /dà is D-measurable, note that for each B 


in B, 
dP in = P(B) = f U ia fe o dh 
Bda = pras a (eTa i 


which by RAP, Theorem 10.1.9 again, equals 


dP 
J Basme, (FP) dh 


II II 
Ra 
So A 
D e 

A= 5 
SIR ty 
Slr > 
———— , a 
9 =e 
WY STS 
Ses) 
Wee 
Y 
bi 
a 
> 


Since dP /dà and E, (d P /dà |D) have the same integral over all sets in B, they 
must be equal P-a.s., so that d P /dà is equal P-a.s. to a D-measurable function. 
So (a) implies (b). Clearly (b) implies (c), so Theorem 5.4 is proved. 


Proof of the factorization theorem 5.1. If D is sufficient, take à from the last 
theorem and let h := dd/dw. Letting fp =dP/dd we have the desired 
factorization, with h € £!(S, B, u) (in fact fh du = 1). 

Conversely, if factorization holds (with A not necessarily integrable), then 
by Lemma 5.2, take {Pp }n>1 equivalent to P, and A := Jsi P,,/2”", so that 
{A} is equivalent to P by Lemma 5.3. Then À is absolutely continuous with 
respect to u and dA/du = hk where k(x) := )°,,., fp,(x)/2" for all x. Also, 
k is D-measurable. For each P € P let gp(x) = fe(x)/k(x) if k(x) > 0 
and gp(x) := O otherwise. Then A(k = 0) = 0, and for each P € P, gp is 
‘D-measurable, P is absolutely continuous with respect to A, anddP/da = gp. 
So D is sufficient for P by Theorem 5.4. 


14:43 


P1: KNP Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


CUUS2019-05 


CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


218 5 Measurability 


Given a statistic T, i.e. a measurable function, from S into Y for measurable 
spaces (S, B) and (Y, F), let D := T-'\(F) := {T7-\(A): Ac F}, ao- 
algebra. For a family Q of laws on (S, B), T is called a sufficient statistic for 
Q iff D is sufficient for Q. If T is sufficient we can write fp = gp o T for 
some F -measurable function gp by RAP, Theorem 4.2.8. Sufficiency, defined 
in terms of conditional probabilities of measurable sets, can be extended to 
suitable conditional expectations: 


Theorem 5.5 Let D be sufficient for a family P of laws on a measurable 
space (S, B). Then for any measurable real-valued function f on (S, B) which 
is integrable for each P € P, there is a D-measurable function g such that 
g = Ep(f\D) a.s. forall P € P. 


Proof. When f is the indicator function of a set in B, the assertion is the 
definition of sufficiency. It then follows for any simple function, which is 
a finite linear combination of such indicators. If f is nonnegative, there is a 
sequence of nonnegative simple functions increasing up to f and the conclusion 
holds (RAP, Proposition 4.1.5 and Theorem 10.1.7). Then any f satisfying the 
hypothesis can be written as f = ft — f~ for f+ and f~ nonnegative and 
the result follows. 


Let u and v be two probability measures on the same measurable space 
(V,U). Take the Lebesgue decomposition (RAP, Theorem 5.5.3) v = Vac + Vs 
where vac is absolutely continuous, and v, is singular, with respect to u. Let 
A €U with v,(A) = u(V \ A) = 0, so vac(V \ A) = 0. Then the likelihood 
ratio R,/, is defined as the Radon—Nikodym derivative dvac/du on A and 
+oo on V \ A. By uniqueness of the Hahn decomposition of V for vy — u 
(RAP, Theorem 5.6.1), R,/,, is defined up to equality (u + vs)- and so (u + v)- 
almost everywhere. 


Theorem 5.6 For any family P of laws on a measurable space (S, B) and 
sub-o-algebra D C B, if D is sufficient for P, then for all P,Q € P, Rovp 
can be taken to be D-measurable, i.e., is equal (P + Q)-almost everywhere to 
an D-measurable function. 


Proof. Suppose D is sufficient for P. Then it is also sufficient for {P, Q}, 
which is dominated by u := P + Q. So by factorization (Theorem 5.1) there 
are D-measurable functions fp and fg and a B-measurable function h such 
that dP/du = fphanddQ/du = foh. Then Ro/p = foh/(frh) = fo/fr. 
where y/0 := +ooif y > Oand 0 if y = 0, is D-measurable (note that Rg/p 
does not depend on the choice of dominating measure). 


Suppose we observe X),..., Xn i.i.d. with law P or Q but we do not know 
which and want to decide. Suppose we have no a priori reason to favor a choice 
of P or Q, only the data. Then it is natural to evaluate the likelihood ratio R gu; px 
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and choose Q if Roxjpx > land P if Rorjpa < 1, while if Rox/px = 1 we still 
have no basis to prefer P or Q. More generally, decisions between P and Q 
can be made optimally in terms of minimizing error probabilities or expected 
losses by way of the likelihood ratio Rg/p or Rgo»jp» as appropriate (the 
Neyman-Pearson Lemma, Lehmann 1991, pp. 74, 125; Bickel and Doksum 
2001, Theorem 4.2.1). By Theorem 5.6, if $ is sufficient for P” for some 
P D {P, Q}, then Rgx/px is B-measurable. Specifically, if T is a sufficient 
statistic, then by Theorem 5.6 and RAP, Theorem 4.2.8, Rọ» /p» is a measurable 
function of T. Thus, no information in (X1, ..., Xn) beyond T is helpful in 
choosing P or Q. In this sense, the definition of sufficiency fits with the informal 
notion of sufficiency given at the beginning of the section. 

It will be shown that empirical measures are sufficient in a sense to be 
defined. Let S, be the sub-o -algebra of A” consisting of sets invariant under 
all permutations of the coordinates. 


Theorem 5.7 S, is sufficient for P” := {P": P e P} where P is the set of 
all laws on (X, A). 


Proof. Let S, be the symmetric group of all permutations of {1, 2,..., n}. 
For each IT € S, and x := (x1,...,%,) E X”, set falx) := (xna), -- +, Xm) ). 
Then fn is a 1-1 measurable transformation of X” onto itself with measurable 
inverse and preserves the product law P” for each law P on (X, A). For any 
C € A", we have 


E 1 
P*(CIS) = DT) Umi 
TeS, 
almost surely for P” since for any B € S,, 


1 1 
P*BAC) = =} PCO fB) = = DP" fn(C) 1B). 


` TleS, ` MES, 


The conclusion follows. 


n 


For example, if X has just two points, say X = {0, 1}, and S := `; xi, 
then S, is the smallest o-algebra for which S is measurable. In this case no 
o-algebra strictly smaller than S, is sufficient (S, is “minimal sufficient’). 

For each B € A and x = (x1, ..., Xn) € X”, let 


1 n 
Pr(BY(x) = =) sQ). 
j=1 


So P, is the usual empirical measure, except that in this section, x œ> P,(B)(x) 
is a measurable function, or statistic, on a measurable space, rather than a 
probability space, since no particular law P or P” has been specified as yet. 
Here, P,,(B)(x) is just a function of B and x. 
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For a collection F of measurable functions on (X”, A”), let Sr be the 
smallest o-algebra making all functions in F measurable. Then F will be 
called sufficient if and only if Sz is sufficient. 


Theorem 5.8 For any measurable space (X, A) and for each n = 1,2,..., 
the empirical measure P, is sufficient for P” where P is the set of all laws on 
(X, A). In other words the set F of functions x +> P,(B)(x), for all B € A, is 
sufficient. In fact the o -algebra SF is exactly Sy. 


Proof. Clearly Sz C S,. To prove the converse inclusion, for each set B € A” 
let S(B) := Unes, fu(B) € S,. Then if B € S,, S(B) = B. Let E := {C € 
A" : S(C) € Sr}. We want to prove E = A”. 

Now € is a monotone class: if C, € E and C, + C or Ca} C, then C € £E. 
Also, since S(-) commutes with finite unions, any finite union of sets in € is in 
E. So it will be enough to prove 


A, X+: X A, € E for any Aj E A, j =1,...,n, (5.2) 


since the collection C of finite unions of such sets, which can be taken to be 
disjoint, is an algebra and the smallest monotone class including C is A” 
(RAP, Propositions 3.2.2 and 3.2.3 and Theorem 4.4.2). Here by another 
finite union the A; can be replaced by Bj) where B),..., B, are atoms 
of the algebra generated by Aj,..., An, so r <2”. So we just need to 
show that for all j(1),..., j(n) with 1 < jG) <r, i=1,...,n, we have 
S(B jay Koen kK Bim) € SF. Now, LE S(B jay Moves K Bim) if and only if for 
each į = 1,...,r, Pa(Bi) = k;/n where k; is the number of values of s such 
that j(s) =i, s= 1,...,n. So Sf = Sn, and by Theorem 5.7, the conclusion 
follows. 


For some subclasses C C A, the restriction of P, to C may be sufficient, and 
handier than the values of P, on the whole o-algebra A. Recall that a class C 
included in a o-algebra A is called a determining class if any two measures 
on A, equal and finite on C, are equal on all of A. If C generates the o-algebra 
A, C is not necessarily a determining class unless, for example, it is an algebra 
(RAP, Theorem 3.2.7 and the example after it). 

Sufficiency of P,(A) for AeC can depend on n. Let X= 
{1,2,3,4, 5}, A= 2%, andC = {{1, 2, 3}, {2, 3, 4}, {3, 4, 5}}. Then C is suffi- 
cient forn = 1, but not for n = 2 since, for example, (6, + 64)/2 = (62 + 55)/2 
on C. This is a case where C generates .A but is not a determining class. 


Theorem 5.9 Let (X, d) be a separable metric space which is a Borel subset 
of its completion, with Borel o -algebra A. Suppose C = {Cx}72., is a countable 
determining class for A. Then for eachn = 1,2, ..., the sequence { P(C}: 
is sufficient for the class P” of all laws P” on (X", A") where P € P, the class 
of all laws on (X, A). 
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Proof. For n = 1,2,..., let I, be the finite set {j/n : j =0,1,...,n}. Let 
[°° be a countable product of copies of J, with the product o -algebra defined 
by the o-algebra of all subsets of [,,. We have: 


Lemma 5.10 Under the hypotheses of Theorem 5.9, for each n = 1,2,..., 
and A € A, there is a Borel measurable function fa on IẸ such that for any 
Xi, ---, Xn E€ X and P, := n`! Pii ôx, we have P,(A) = JaPa CNL). 


Proof. Since C is a determining class, the function f4 exists. We need to show 
it is measurable. 

If X is uncountable, then by the Borel isomorphism theorem (RAP, Theorem 
13.1.1) we can assume X = [0, 1]. Or, if X is countable, then A is the o -algebra 
of all its subsets, and we can assume X = {0} U Y where Y C {1/k}k>1. Then 
X is always complete and included in [0, 1]. Let 


X := {x := F= EX”: XALL L Anh. 


The map x +> P® is 1-1 from X into the set of all possible empirical 
measures P,,, since if Q := P = PY”, with x, y € X™, then x; = yı =the 
smallest u such that Q({u}) > 0. Next, x2 = yo = x, if and only if Q({x,}) => 
2/n, while otherwise x. = y2 = the next-smallest v such that Q({v}) > 0, and 
so on. 

Now, X” is a Borel subset of X”. The completion of X” for any of the usual 
product metrics is isometric to S” where S is the completion of X. Clearly X” 
is a Borel subset of $”. It follows that X“ is a Borel subset of its completion. 
Here the following will be useful: 


Lemma 5.11 On I°°, the product o-algebra B®, the smallest o-algebra for 
which all the coordinates are measurable, equals the Borel o-algebra B% of 


the product topology T. 


Proof. Clearly B®% C By since the coordinates are continuous. Conversely, 7 
has a base R consisting of all sets ms A j where A; = J, for all but finitely 
many j; R is countable (since J, is finite) and consists of sets in 5%, and every 
U e T is a countable union of sets in R, so U € B® and B = B” so Lemma 
5.11 is proved. 


Now to continue the proof of Lemma 5.10, the map f : x > {P(C,)}®, 
is Borel measurable from X into Iv’. Since {C;,}?2, are a determining class, 
f is one-to-one. Thus by Appendix G, Theorem G.6, f has Borel image 
f{X™] in 12° and f~! is Borel measurable from f[X“] onto X™. Then 
f —! extends to a Borel measurable function h on all of [°° into X™) since 
X is Borel-isomorphic to R, or a countable subset, with Borel o-algebra, 
by the Borel isomorphism theorem (RAP, Theorem 13.1.1 again), and thus the 
extension works as for real-valued functions (RAP, Theorem 4.2.5). For any 
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Aé A, ga: x> P(A) is Borel measurable. Thus fa = g4 o h is Borel 
measurable, and Lemma 5.10 is proved. 


Now to prove Theorem 5.9, we know from Theorem 5.8 that the smallest 
o-algebra S making all functions x +> P,,(B)(x) measurable for B € A is 
sufficient. By Lemma 5.10, Sz is the same as the smallest o -algebra making 
x t+» P,,(Cx) measurable for all k, which finishes the proof. 


In the real line R, the closed half-lines (—0o, x] form a determining class. In 
other words, as is well known, a probability measure P on the Borel ø -algebra 
of R is uniquely determined by its distribution function F (RAP, Theorem 
3.2.6). It follows that the half-lines (—oo, q] for q rational are a determining 
class: for any real x, take rational q; | x, then F(q) | F(x). Thus we have: 


Corollary 5.12 In R, the empirical distribution functions defined by F(x) := 
P,((—o0, x]) for all x are sufficient for the family P” of all laws P” on R" 
where P varies over all laws on the Borel o-algebra in R. 


5.2 Admissibility 


Let F be a family of real-valued functions on a set S, measurable for a o- 
algebra B on S. Then there is a natural function, called here the evaluation 
map, F x S +> R given by (f, x)= f(x). It turns out that for general F there 
may not exist any o-algebra of subsets of F for which the evaluation map 
is jointly measurable. This section is about the possible existence of such a 
o-algebra and its uses. 

Let (S, B) be a measurable space. Then (S, 8) will be called separable if B 
is generated by some countable subclass C C $ and B contains all singletons 
{x}, x € S. In this section (S, 5) will be assumed to be such a space. Let F be a 
collection of real-valued functions on S. (The following definition is unrelated 
to the usage of “admissible” for estimators in statistics.) 


Definition. F is called admissible iff there is a o-algebra 7 of subsets of 
F such that the evaluation map (f,x)t> f(x) is jointly measurable from 
(F, T) x (S, B) (with product o-algebra) to R with Borel sets. Then 7 will be 
called an admissible structure for F. 

F will be called image admissible via (Y, S, T) if (Y, S) is a measurable 
space and T is a function from Y onto F such that the map (y, x)  T(y)(x) 
is jointly measurable from (Y, S) x (S, B) with product o-algebra to R with 
Borel sets. 

To apply these definitions to a family C of sets let F = {14 : A € C}. 


Remarks. There is no assumption of separability of measurable spaces in the 
definition just given. In the next three theorems, (S, 5) is assumed separable, 
but still there is no restriction on the o-algebras 7 on F or S on Y. 
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If G C F with F admissible, then clearly so is G, with o-algebraG nT. 

For an example, let (K, d) be a compact metric space and let F be a set 
of continuous real-valued functions on K, compact for the supremum norm. 
Then the functions in F are uniformly equicontinuous on K by the Arzela— 
Ascoli theorem (RAP, Theorem 2.4.7). It follows that (f, x) f(x) is jointly 
continuous for the supremum norm on f € F and d on K. Since both spaces 
are separable metric spaces, the map is also jointly measurable, so that F is 
admissible. 

If a family F is admissible, then it is image admissible, taking T to be the 
identity. In regard to the converse direction here is an example. Let S = [0, 1] 
with usual Borel o-algebra B. Let (Y, S) be a countable product of copies of 
(S, B). For y = {yn}; € Y let T(y)(x) := 17(x, y) where J := {(x, y) : x = 
Yn for some n}. Let C be the class of all countable subsets of $ and F the 
class of indicator functions of sets in C. Then it is easy to check that F is 
image admissible via (Y, S, T). If a o-algebra T is defined on F by setting 
T :={F CF: T7'(F) € S}, then T is not countably generated (see Problem 
5(b)) although S is. This example shows how sometimes image admissibility 
may work better than admissibility. 


Theorem 5.13 For any separable measurable space (S, B), there is a sub- 
set Y of [0,1] and a 1-1 function M from S onto Y which is a measur- 
able isomorphism (is measurable and has measurable inverse) for the Borel 
o-algebra on Y. 


Remarks. Note that Y is not necessarily a measurable subset of [0, 1]. On the 
other hand if (S, 5) is given as a separable metric space which is a Borel subset 
of its completion, with Borel o-algebra, then (S, B) is measurably isomorphic 
either to a countable set, with the o-algebra of all its subsets, or to all of [0, 1] 
by the Borel isomorphism theorem (RAP, Theorem 13.1.1). 


Proof. Recall that the Borel subsets of Y as a metric space with usual metric 
are the same as the intersections with Y of Borel sets in [0, 1], since the same 
is true for open sets. 

Let C := {Cj}j>1 be a countable set of generators of B. Consider the map 
f: x {1c,(x)}j=1 from S into a countable product 2° of copies of {0, 1} 
with product o-algebra. Then f is 1—1 and onto its range Z. Thus it preserves 
all set operations, specifically countable unions and complements. So it is easily 
seen that f is a measurable isomorphism of S onto Z. 

Next consider the map g : {z;} => a 2z;/3/ from 2% into [0, 1], actually 
onto the Cantor set C (RAP, proof of Proposition 3.4.1). Then g is continuous 
from the compact space 2° with product topology onto C. It is easily seen 
that g is 1-1. Thus g is a homeomorphism (RAP, Theorem 2.2.11) and a 
measurable isomorphism for Borel o-algebras. The Borel o-algebra on 2° 
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equals the product o-algebra (Lemma 5.11). So the restriction of g to Z is 
a measurable isomorphism onto its range Y. It follows that the composition 
g o f, called the Marczewski function 


M(x) := $2- 1c, @)/3" 


n=l 


is 1-1 from S onto Y C I := [0,1] and is a measurable isomorphism 
onto Y. 


Let (S, B) be a separable measurable space where 6 is generated by 
a sequence {C;}. By taking the union of the finite algebras generated by 


C,,..., Cn for each n, we can and do take C := {C;};>1 to be an algebra. 
Let Fo be the class of all finite sums )~”_, c; 1c, for rational c; € R, and n = 
1,2,.... Then “Borel classes” or “Banach classes” are defined as follows by 


transfinite recursion (RAP, 1.3.2). Let (Q, <) be an uncountable well-ordered 
set such that foreach £ € Q, {a € Q: a < p} is countable. (Specifically, one 
can take Q to be the set of all countable ordinals with their usual ordering.) For 
any countable set A C Q, {y : y < x for some x € A} is countable, so there is 
az € Q with x < zforall x € A. For each œ € Q there is a next larger element 
called a + 1. Let 0 be the smallest element of Q. For each a € Q, given Fy, 
let Fy+1 be the set of all limits of everywhere pointwise convergent sequences 
of functions in Fy. If 6 € Q is not of the form œ + 1 (£ is a “limit ordinal’), 
p > 0 and F, is defined for all œ < £ let Fg be the union of all Fẹ for a < £. 
Note that Fa C Fg whenever a < p. Let U := Urea Fa 


Theorem 5.14 For (S, B) separable, U is the set of all measurable real func- 
tions on S. 


Proof. Clearly, each function in U is measurable. Conversely, the class of 
all sets B such that 1g € U is a monotone class and includes the generating 
algebra A, so it is the o -algebra G of all measurable sets (RAP, Theorem 4.4.2). 
Likewise, for a fixed A € A and constants c, d, the collection of all sets B such 
that cl, + d1g € U is B. Then fora fixed B € B, the set of all C € B such that 
clc +d1gz € U isall of B. By a similar proof for a sum of n terms, we get that 
all simple functions )~”_, cil gq) are in U for any c; € R and B(i) € B. Since 
any measurable real function f is the limit of a sequence of simple functions 
Jn + 8n Where fa — max(f, 0) and g, — min( f, 0) (RAP, Proposition 4.1.5), 
every measurable real function on S is in U. 


On admissibility there is the following main theorem: 


Theorem 5.15 (Aumann) Let I := [0, 1] with usual Borel o-algebra. Given 
a separable measurable space (S, B) and a class F of measurable real-valued 
functions on S, the following are equivalent: 
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(i) F C Fa for some a € Q; 


(ii) There is a jointly measurable function G : I x S > R such that for each 
f eF, f= G(t,-)forsomet € 1; 


(iii) There is a separable admissible structure for F; 
(iv) F is admissible; 
(v) 27 is an admissible structure for F; 


(vi) F is image admissible via some (Y, S, T). 


Remarks. The specific classes F, depend on the choice of the countable 
family A of generators, but condition (i) does not: if C is another countable set 
of generators of B with corresponding classes Gy, then for any œ € Q there are 
b € Qand y € Q with Fy C Gg and Ga C Fy. 


Proof. (ii) implies (iii): for each f € F choose a unique t € J and restrict the 
Borel o-algebra to the set of t’s chosen. Then (iii) follows. 

Clearly (iii) implies (iv), which is equivalent to (v). 

(iv) implies (iii): note that any real-valued measurable function G (for the 
Borel o-algebra on the range R as usual) is always measurable for some 
countably generated sub-o-algebra, for example, generated by {G > q}, q 
rational. Let G be the evaluation map G(f, t) := f(t). The product o-algebra 
T ® B is the union of the o-algebras generated by countable sets of rectangles 
A; x B; for A; € T and B; € B. The o-algebra generated by countably many 
o-algebras on F x S generated in this way is also generated in the same way. 
So the evaluation map G is measurable for such a sub-o-algebra. Let D be the 
o-algebra of subsets of F generated by the A;. For any two distinct functions 
f,g inf, f(x) 4 g(x) for some x. The map h +> h(x) is D measurable. So 
{ f} is the intersection of those A; that contain f and the complements of the 
others, and (iii) follows. So (iii) through (v) are equivalent. 

(iii) implies (ii): by Theorem 5.13, there is a subset Y C J := [0, 1] with 
Borel o-algebra and a measurable isomorphism t +> G(t, -) from Y onto F. 
The assumed admissibility implies that (t, x) œ> G(t, x) is jointly measurable. 
By the general extension theorem for real-valued measurable functions (RAP, 
Theorem 4.2.5), although Y is not necessarily measurable, we can assume G 
is jointly measurable on J x S, proving (ii). So (ii) through (v) are equivalent. 

(ii) implies (i): On [0, 1] x S, take generators of the form A; x B;, where 
A; are Borel, B; € B, and {A;};>1 and {B;};>1 are algebras. Then by Theorem 
5.14, G belongs to some Fg on S x T. It will be shown by transfinite induction 
on a that the sections G(f, -) on S all belong to F, for the generators {B;}. 

If Ge Fo on I x S, then G = $ ;_;cila;xg; for some c;, n, Borel Aj, 
and B; € B. Foreacht € I, G(t, s) = } `; cila,(t)12,(s), so G(t, -) € Fo on 
S. Suppose the statement holds for a given a. Let G € Fw+1 on J x S. Then 
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for some Gg € Fy on I x S, G;(t, s) > G(t,s) as k —> œo for all t € J and 
s € S. For each t and k, G;(t,-) € Fa on S by induction assumption. Thus 
G(t, -) € Fx+1 on S as desired. Let 6 be a limit ordinal (not a successor a + 1 
of any a), and suppose the statement holds for each a < $. Then by definition 
of Fg, it also holds for a = $. This completes the proof by induction, and so 
(ii) implies (i). 

(i) implies (i): for this we need universal functions, defined as follows. A 
jointly measurable function G : J x St» R will be called a universal class a 
function if every function f € Fy on S is of the form G(t, -) for some t € J. 
(G itself will not necessarily be of class œ on J x S.) Recall by the way that an 
open, universal open set U in N% x S exists for any separable metric space S 
(RAP, Proposition 13.2.3), where “universal open” means that for every open 
set V C S, there is an x € N” such that V = {y : (x, y) € U}. 


Theorem 5.16 (Lebesgue) For any a € Q there exists a universal class a 
functionG: I x SHR. 


Proof. For a=0, Fy is a countable sequence {fk}k>1 of functions. Let 
G(1/k, x) := f(x) and G(t, x) := 0 if t 4 1/k for all k. Then G is jointly 
measurable and a universal class 0 function. 

For general a > 0, by transfinite induction (RAP, Section 1.3) suppose there 
is a universal class 6 function for all 6 < œ. First suppose œ is a successor, 
a = f +1 for some £. Let H be a universal class 6 function on I x S. 

Let 7° be the countable product of copies of 7 with the product o-algebra. 
The product topology on 7% is compact and metrizable (RAP, Theorem 2.2.8, 
Proposition 2.4.4). Its Borel o-algebra is the same as the product o-algebra, by 
way of the usual base or subbase of the product topology (RAP, Sections 2.1, 
2.2). For t = {tijno1 € I” let G(t, x) := limsup, ,,, H(t, x) if the lim sup 
is finite, otherwise G(t, x) := 0. Then G is jointly measurable. Now I, as a 
Polish space, is Borel-isomorphic to J (RAP, Section 13.1), so we can replace 
I” by I. Then G is a universal class œ function. 

If æ is not a successor, then there is a sequence k + a, Bk < a, meaning 
that for every B < «, there is some k with B < px. To see this, as {£ : B < a} 
is countable, we can write it as {y}j=1,2,.., let 61 := yı, and for each k > 1, 
define 6,41 as y; for the least i such that y; > Bx. 

For each k, let Gz be a universal class 6, function. Define G on I 2x § 
by G(s, t,x) = G;(t, x) if s = 1/k and G(s, t,x) = 0 otherwise. Then G is 
jointly measurable. Since 7? is Borel-isomorphic to J we again have a universal 
class a function, proving Theorem 5.16. 


With Theorem 5.16, (i) in Theorem 5.15 clearly implies (ii), so (i) through 
(v) are equivalent. 
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(iv) implies (vi) directly. If (vi) holds, let Z be a subset of Y on which the map 
z |> T(z)(-) is one-to-one and onto F. Let Sz and Tz be the restrictions to Z of 
S and T respectively. For a function f and set B let f[B] := {f(x): x € B}. 
Then F remains image admissible via (Z, Sz, Tz), and {Tz[A]: A € Sz} is 
an admissible structure for F, giving (iv). 


For 0 < p < œ and a probability law Q on (S, B) we have the space 
L?(S, B, Q) of measurable real-valued functions f on S such that f | f|?dQ < 
oo, with the pseudo-metric 


Gif=¢ragy’?, 1 < p < œ; 
dp.o(f, 8) := 1 SIF —gl?dQ, 0<p<l; 
inffe >0: O(|f—gl>e)<e}, p=0. 


In admissible classes, d,,g-open sets are measurable, as follows: 


Theorem 5.17 Let (S, B) be a separable measurable space, 0 < p < œ, and 
F C L°(S, B, Q) where F is admissible. Then if F is image admissible via 
(Y,S,T), U C F and U is relatively d, g-open in F, we have TOUU)ES. 


Proof. Since (S, $B) is separable, the pseudo-metric spaces £?(S, B, Q) are 
separable for 0 < p < ov. To prove this, let C be a countable set of generators 
of 6, where we can assume C is an algebra as noted before Theorem 5.14. 
Let G be the set of all simple functions beer cila, for Aj € C, c; rational, and 
n=1,2,.... Then G is countable and is dense in £? for 0 < p < œ, as is 
easily checked. So (e.g., RAP, Proposition 2.1.4), U is a countable union of 
balls {f : dp,o(f,g) <r}, g€U, 0<r <oo. So for p > 0 it is enough to 
show that each function y œ> f |T(y)(-) — g|?d@Q is measurable. For g fixed, 
(y, x) |T(y)(x) — g(x)|? is jointly measurable. So for 0 < p < œo we can 
reduce to the case p = 1 and g = 0 with T(y)(x) > 0 for all x, y. Now, the 
Tonelli—Fubini theorem implies the desired measurability. For p = 0, givene > 
0, by the Tonelli—Fubini theorem again, the set A, of y such that Q(|T(y)(x) — 
g(x)| > £) < £ is measurable, and {y: do,o(T(y), g) < r} is the union of A, 
for € rational, 0 < € <r. 


Corollary 5.18 If (S, B) is a separable measurable space, F C L! (S, B, Q) 
and F is image admissible via (Y, S, T) then y œ> f T(y)dQ is S-measurable. 


Proof. For any real u, {f : f fdQ > u} is open for d,o. 

For 1 < p < œand f, g € L’(S, B, Q) let pp,o(f, 8) := dp,o(fo,a> 80,0) 
where for h € L! (S, B, Q), hoo := h— f hdQ. Thus for pg as defined in 
Section 3.1, 99 = p2,Q. 


Corollary 5.19 Jf (S, B) is a separable measurable space, 1 < p < œ, F C 
L°(S, B, Q), F is image admissible via (Y, S, T), U C F and U is pp g-open, 
then T7!(U) € S. 
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Proof. Let G be the set of functions f — f fdQ, f €e F. For ye Y let 
W(yXx) := T(yXx)— [TO)dQ. Then by Corollary 5.18, W is jointly 
measurable and G is image admissible via (Y, S, W). Now f € U if and only 
if foo € V where V is dp ọ-open in G. Then T~'(U) = W~'(V) € S by 
Theorem 5.17. 


Next, here is a definition extending the definitions of empirical measure and 
process as given so far. 


Definition. A stochastic process Xn = Xn,m( f, œ) indexed by a class F of 
measurable functions on a measurable space (S, 8), for w in Q, will be called 
an empirical-type process if: 

(a) For some integers n > 1 and m > 0, the probability space (Q, Pr) is Q = 
S”+™ x Qo, where for some probability measures P and Q on S and Pro on 
Qo, Pr = P” x Q” x Pro; 


(b) The process is of the form, for wọ € Qo, 


n+m 
Xnm( f, 0) = (awr +) vio (f), 


i=1 


where x; are coordinates on S”*”, c; are real-valued random variables on Qo, 
and for 1 <i < n + m, Pr(c; = 0) < 1. 
If m = 0, the process will be written as X,. 


Examples of empirical-type processes are: the usual empirical process v,, 


with m = 0, co = —/n, and cj =1//n, j =1,...,n; and the following 
examples, which all have co = 0: the usual empirical measure P,,, with m = 0 
and c; = 1/n, j = 1,...,n; symmetrized empirical processes, with m = n, 
P = Q,andc,4; = —c; for j = 1,...,n, to be treated in Lemma 6.5; and the 


two-sample empirical process P, — Qm with n > 0 and m > 0, or its multiple 
by /nm/(n + m), to be treated in Section 9.1. 

To avoid some pathologies, empirical-type processes are defined via coordi- 
nates and product measures on Cartesian product spaces, as opposed to having, 
for example, x1,..., Xn 1.i.d. P, defined on some arbitrary probability space. 

Not all processes to be considered later fit into the above definition: for 
example, in Poissonized empirical processes, as in Lemma 9.11 and what 
follows it, n itself is random. These processes, and bootstrapped empirical 
measures PË and processes fate? — P,,) in Section 9.2, will be treated as 


n 
special cases. 


Proposition 5.20 Let (S, B) be a separable measurable space. Then: 


(a) Forn =1,2,..., S” with product o -algebra B®" is also separable. 
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(b) If F is an image admissible Suslin class of measurable functions on S via 
(Y, S, T), and Xy,m is an empirical-type process indexed by F, with probability 
space S"™™ x Qo and co = 0, x = (x1, -.-,Xn4m) € S”™, and wy € Qo, then 
(y, X, wo) > Xn,m(T(y)) is jointly measurable. 

(c) If possibly co #40 and F CL'(S,B, P), then again (y, x, œ) => 
Xnm(T(y)) is jointly measurable. 


Proof. Part (a) is straightforward. For (b), Xn,m(T(y)) is a Borel measurable 
function of the n + m jointly measurable functions (y, x) +> T(y)(x;) and of 
c;(@o). For (c) one can combine part (b) and Corollary 5.18 via another Borel 
measurable function. 


5.3 Suslin Properties and Selection 


Here is another counterexample on measurability, to add to the two at the 
beginning of the chapter. Let X = [0, 1] with Borel o-algebra and uniform 
(Lebesgue) probability measure P := U[0, 1]. Let A be a non-Lebesgue mea- 
surable subset of [0, 1] (e.g., RAP, Theorem 3.4.4). Let C := {{x}: x € A}. 
Then C is a collection of disjoint sets, so S(C) = 1 by Theorem 4.10. Also 
C, being a class of singletons, is admissible, e.g., by Theorem 5.15(ii) with 
G(t, s) = 1 for t = s, G(t, s) = 0 otherwise. But, || P;||c is nonmeasurable, 
being 1 if and only if X, € A, and likewise any ||P, ||c or ||P, — Pllc is non- 
measurable. So some measurability condition beyond admissibility is needed 
for ||P, — P || F to be measurable. A sufficient condition will be provided by 
Suslin properties, as follows. 

A Polish space is a topological space metrizable as a complete separable 
metric space. A separable measurable space (Y, S) will be called a Suslin space 
iff there is a Polish space X and a Borel measurable map from X onto Y. Recall 
that by the Borel isomorphism theorem (RAP, Theorem 13.1.1), if A is a Borel 
set in a Polish space, then there is a 1-1 Borel measurable function with Borel 
measurable inverse from a Polish space Z onto A, where moreover we can 
take Z to be either [0, 1], or a converging sequence together with its limit, or 
a finite set. Thus in the definition of Suslin space, we can equivalently replace 
the Polish space X by any Borel set in a Polish space. 

If (Y, S) is a measurable space, a subset Z C Y will be called a Suslin set 
iff it is a Suslin space with the relative o-algebra ZS. 

Given a measurable space (X, B) and M C X, M is called universally 
measurable or u. m. iff for every probability law P on 6, M is measurable 
for the completion of P, in other words for some A,B €B, ACM CB, 
and P(A) = P(B). Ina Polish space, all Suslin sets are universally measurable 
(RAP, Theorems 13.2.1 and 13.2.6). A function f from X into Z, where (Z, A) 
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is a measurable space, will be called universally measurable or u. m. iff for 
each set B € A, f~!(B) is universally measurable. 

If (Q, A) is a measurable space and F a set, then a real-valued function 
X: (f,@)h X(f, œw) will be called image admissible Suslin via (Y, S, T) 
iff (Y, S) is a Suslin measurable space, T is a function from Y onto F, and 
(y, œw) + X(T(), æ) is jointly measurable on Y x Q. Equivalently, Y could 
be taken to be Polish with S its Borel o-algebra. 

As the notation suggests, a main case of interest will be where F is a set of 
functions on Q and X(f, w) = f(w). Then F will be called image admissible 
Suslin via (Y, S, T) if X is. 

Recall the notion of separable measurable space defined in the last section. 
Note that any separable metric space with its Borel o-algebra is a separable 
measurable space, as follows from RAP, Proposition 2.1.4. We have: 


Theorem 5.21 A measurable space (X, B), where B is countably generated, 
is separable if and only if it separates the points of X, so that for any x # y in 
X, there is some A € B containing just one of x, y. 


Proof. If (X, B) is separable, then {x} € B for each x € X, so B separates 
points. The converse direction follows from the proof of Theorem 5.15: (iv) 
implies (iii). 


Theorem 5.22 (Sainte-Beuve selection theorem) Let (Q, A) be any measur- 
able space and let X : F x Q> R be image admissible Suslin via (Y, S, T). 
Then for any Borel set B C R, 


TIx(B) := {w: X(f,@) € B for some f € F} 


is u. m. in Q, and there is a u. m. function H from Y1x(B) into Y such that 
X(T(A(@)), w) € B forall w € Ty(B). 


Note. Here (Q, A) need not be Suslin or even separable. 


Proof. As noted above, we can assume Y is Polish and S is its Borel o- 
algebra. For any measurable set V in a product o-algebra, here V = {(y, œ) : 
X(T(y), œw) € B} C Y x Q, there are countably many measurable sets A, C Y 
and B, C Q such that V is in the o-algebra generated by the sets A, x B, (as 
in the proof of Theorem 5.15, (iv) implies (iii)). Thus (y, œ) œ> X(T (y), œ) is 
jointly measurable for the given Suslin o-algebra S in Y and a o-algebra B in 
Q generated by a sequence {B,,} of measurable sets. 

Define a Marczewski function b, as in the proof of Theorem 5.13, for the 
sequence {B,} in place of {A,}. Then b is a -measurable function from Q into 
the Cantor set C C [0, 1] (defined, e.g., in RAP, proof of Proposition 3.4.1). 
Define w =g a’ iff for all n, w € B, if and only if w’ € B,. Equivalently, for 
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all V € B, we V if and only if w’ € V. Then b(%') = b(œ) if and only if 
w =g æ. 

Let 6 be a B-measurable function. Then if b(w) = b(w’), we must have 
B(w) = Lw’). So B(w) = g(b(@)) for some function g. Now, 6 is measurable 
for the o-algebra of all sets b~'(C) where C is a Borel subset of R, since each 
B, is such a b~!(C,,) (by the proof of Theorem 5.13). So g can be taken to be 
Borel measurable (RAP, Theorem 4.2.8). Similarly, X(T(y), w) = F(y, b(@)) 
for some function F on Y x C. 

The next step in the proof of Theorem 5.22 is the following fact: 


Lemma 5.23 Every S ® B measurable real function Y on Y x Q can be writ- 
ten as F(-, b(-)) where F is S ® Bc measurable and Bc is the Borel o -algebra 
in C. 


Proof. This is easily seen when w is the indicator ofa set S x B, S € S, B € B, 
or a finite linear combination of such indicators. A finite union of sets 
S; x Bi, S; € S, B; € B, can be written as a disjoint union (RAP, Propo- 
sition 3.2.2). Thus the Lemma holds when yw is the indicator of any such 
union. Next, suppose y, are S ® B measurable, w, = F,,(-, b(-)) where F, are 
S ® Bc measurable, and Y, — y pointwise on Y x Q. For any y € Y and 
w, w € Q such that b(w) = b(w’), we have w,(y, b(@)) = Y, (y, b(@’)) for all 
n, so W(y, b(w)) = WQy, b(a’)). Let F(y, c) := liM, Fa (y, c) whenever the 
sequence converges, as it will if c = b(w) for some w € Q. Otherwise, set 
F(y, c) := 0. Then F is S ® Bc measurable (RAP, proof of Theorem 4.2.5) 
and Y(y, w) = F(y, b(@)) for all y € Y and w € Q. Since the finite unions 
of sets S$; x B; form an algebra, the monotone class theorem (RAP, Theorem 
4.4.2) proves the Lemma when vy is the indicator function of any set in the prod- 
uct o-algebra S © B. It thus holds when w is any simple function (finite linear 
combination of such indicator functions). Since each measurable function is a 


pointwise limit of simple functions, the Lemma follows. 


So, X(T(y), œ) = F(y, b(@)) for an S ® Bc measurable function F on 
Y xC. 

Now, {(y,c)€ Y xC: F(y,c) € B} is a Borel subset of Y x C (RAP, 
Proposition 4.1.7) and thus a Suslin set (RAP, Theorem 13.2.1). It follows that 
IFB := {c € C: for some y € Y, F(y,c) € B} is a Suslin subset of C, and 
so universally measurable (RAP, Theorem 13.2.6). 

By a selection theorem (RAP, Theorem 13.2.7) there is a u. m. function h 
from IFB into Y with F(h(c), c) € B for all c € Ip B. Next we need: 


Lemma 5.24 Let (D, D) and (G, G) be two measurable spaces and g a mea- 
surable function from D into G. Let U C G be u. m. for G. Then g7! (U) is u. 
m. for D. 
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Proof. Let P be alaw on (D, D), so P o g! is a law on (G, G). Take A, BEG 
with A C U C Band (P o g7!)(B \ A) = 0. Then g7! (A) and g7! (B) arein D, 
g (A) C g`} (U) C g7! (B) and P(g7!(B) \ 871A) = P(g (B \ A)) = 0. 


Remark.The Lemma includes the case where D C G, g is the identity, and D 
includes the relative o-algebra {J N D: J € G}, while possibly D ¢ G and D 
may not be u. m. for G. 


Now to finish the proof of Theorem 5.22, Lemma 5.24 implies that 
IIx(B) = b'{ceC: forsome y €Y, F(y,c) € B} 


isu. m. Let H:=hob from Q into Y. Then for any E € S, H7'(E) = 
b-'(h-(B)). Here h-!(E) is a u.m. set in T(B), so b7!(h7!(E)) is u.m. 
in Q by Lemma 5.24, and H is a u. m. function. By choice of h we have 
X(T(A(o@)), w) € B for all w € IyB. 


Some possibilities for the set B C R are the sets {x : x > t} or {x : |x| >t} 
for any real t. These choices give: 


Corollary 5.25 Let (f, w) => X(f, œw) be real-valued and image admissible 
Suslin via some (Y, S, T). Then the two functions w +> sup{X(f,@): f € F} 
and w +> sup{|X(f, @)|: f € F} are both u. m. 


The image-admissible Suslin property is preserved by composing with a mea- 
surable function: 


Theorem 5.26 Let (Q, A) be a measurable space and F; for i =1,...,k 
classes of measurable real-valued functions on Q. Let X',...,X* be 
image admissible Suslin real-valued functions on F; x Q, i =1,...,k, via 
(Yi, Si, T;), i =1,...,k respectively. Let g be a Borel measurable function 
from R¥ into R. Then 


CSi- fe, @) > 9(X'(fi, w), ..., X* (fe, 0) 


is image admissible Suslin via some (Y, S, T). Specifically, we can let Y = Y; x 
+++ X Y, with product o-algebra S = S| ® -- - Q S; and let T(y,, ..., Yk) := 
(TiO), -> Tye). 


Proof. Clearly (Y, S) is Suslin and the joint measurability holds. 


Next, here are examples showing that if the Suslin assumption on Y is 
removed from Theorem 5.22, it may fail. Let (Q, <) and the class C of countable 
initial segments 7, be as in the Example near the beginning of this chapter. We 
can take Q to be (in 1-1 correspondence with) a subset of [0, 1], by the axiom of 
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choice (or, by the continuum hypothesis, all of [0, 1]). Here the ordering < has 
no relation to any usual structure on [0, 1]. Then C is admissible by Theorem 
5.15, since the collection of all countable sets is of bounded Borel class: all 
finite unions of open intervals (q, r) with rational endpoints are in some Fy, 
then all finite sets are in F,+; and all countable sets in Fig41)41 =: Fa+2- 

Let P bea law on Q which is 0 on countable sets and 1 on sets with countable 
complement. Such a law, on the o-algebra generated by singletons, exists on 
any uncountable set. Under the continuum hypothesis with Q = [0, 1] we can 
take P to be Lebesgue measure or any nonatomic law. 

Let Pi = ôx, P2 = (6; + 4,)/2 and Qı = 6, where x, y, z are coordinates 
on Q? with law P?, so x, y, and z are iid. (P). Then supyec(P: — Q1)(A) = 1 
if and only if x < z. Thus sup,¢c(P: — Q1)(A) is nonmeasurable. 

If we let B := {(x, y, Z): supyec |(P2 — Q1)(A)| = 1}, it can be seen like- 
wise that B must not be measurable. So “Suslin” cannot simply be removed 
from Corollary 5.25 or Theorem 5.22. 

“Two-sample” empirical processes, which are multiples of P — Qn, are 
of direct interest in statistics, as in testing the hypothesis that P,, and Q, are 


sampled from the same distribution P = Q. If they are, then , / == (Pm — Qn) 


m+n 


converges in law to Gp in 4” (F) if F is a Donsker class for P, as will be seen 
in Section 9.1. On the other hand, facts can be proved about the one-sample 
empirical process ./n(P,, — P via symmetrization, subtracting an independent 
copy of itself to get ./n(P, — Qn), as will be done in Section 6.1 and applied 
in Section 6.2 and thereafter. 

An admissible structure can be put on spaces of closed sets. Let (X, d) be a 
separable metric space and Fo the collection of all nonempty closed subsets of 
X. Then Fo is admissible: there is a countable base for the topology of X, so 
for some g, all finite unions of sets in the base are in Fy, so all open sets are 
in Fy41. Then all closed sets are in Fy42 since any closed set F is a countable 
intersection of open sets 


U, := {x: d(x, F) := inf d(x, y) < 1/n}. 
ye 


The topology of X can be metrized by a metric d for which (X, d) is totally 
bounded (RAP, Theorem 2.8.2). Assume d is such a metric. Let hy be the 
Hausdorff metric, hg(A, B) := max{sup,., d(x, B), sup,-g d(y, A)} for any 
two closed sets A, B. 

Since d is totally bounded, it is easily seen that (Fo, ha) is separable, since 
the finite subsets of a countable dense set in X are dense in Fo for hg. 

The Borel o-algebra of hq will be called an Effros Borel structure on 
Fo. (Effros (1965) proved that for d totally bounded this Borel structure is 
unique.) 
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Proposition 5.27 For any separable metric space X with totally bounded met- 
ric d, the Effros Borel structure (of ha) is admissible on Fo. Also, for any law 
P on the Borel sets of X, P(-) is measurable for the Effros Borel structure. 


Proof. The set {(x, F): x € F € Fo} is closed in X x Fo: if x, € F, for 
all n and (xn, Fa) > (x, F), then x, —> x in X and h,g(F,, F) — 0. Then 
d(xn, F) > 0 and so x € F. Since both (X, d) and (Fo, hg) are separable, a 
Borel set in their product is in the product o-algebra (RAP, Proposition 4.1.7). 

For any law P and c > 0, the set {F € Fo: P(F) > c} is closed for ha, 
since if F, —> F for hg, P(F,) => c, and £ > 0, let F° := {y : d(x, y) < € for 
some x € F}. Then F, C F° for n large enough, so P(F*) > c and letting 
e€ = 1/m}0 shows P(F) > c. So P(-) is upper semicontinuous and so Effros 
measurable. 


For families of functions, we have the following: 


Theorem 5.28 (a) Let S be a topological space and F a family of bounded real 
functions on S, equicontinuous at each point of S. Then ( f, x) +> f(x) isjointly 
continuous F x Ste R, with the supremum norm || fllo := sup, |f()| 
on F. 


(b) If in addition F is separable for || || and S is metrizable as a separable 
metric space (S, d), then ( f, x) => f(x) is jointly measurable for the Borel o- 
algebras of || |l on F and d on S. Thus F is admissible. So is G, the collection 
of all subgraphs {(x, y): 0< y < f(x) or f(x) < y < O} for f EF. 

(c) If, moreover, F with ||: ||. distance and its Borel o -algebra is a Suslin set, 
then F and G are image admissible Suslin. 


Proof. Part (a) is immediate. For F, (b) follows from the fact that on a 
Cartesian product of two separable metric spaces, the Borel o-algebra of 
the product topology equals the product o-algebra of the Borel o-algebras 
on each space (RAP, Theorem 4.1.7). For G, the set {(y,z): 0< y<zor 
z < y < 0} is closed, thus Borel in R?. Composing its indicator function 
with (f, x)» z := f(x) thus gives a measurable function. Then (c) follows 
directly. 


Example (Adamski and Gaenssler). Let H C [0, 1] be a nonmeasurable set 
with A,(H) = 0 and A*(H) = 1 where à is Lebesgue measure (e.g., RAP, 
Theorem 3.4.4). Then for each Borel set B C [0, 1], letting u(B N H) := 
A*(B H) defines a countably additive probability measure on the Borel 
subsets of H, as a metric space with the usual metric from R (RAP, The- 
orem 3.3.6). Likewise, A* gives a probability measure v on the Borel sets 
of H° := [0,1]\ H. Take the countable product of probability spaces 
(2, p) := OX (An, Bn, On) where for n odd, A, = H and p, = u while 
for n even, A, = H° and p, = v. Such a product of probability spaces always 
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exists (e.g., RAP, Theorem 8.2.2). Let X,, be the nth coordinate on Q, viewed 
as a map from Q into [0, 1]. Then each X,, is measurable and has law U[0, 1], 
the uniform distribution on [0, 1]. Thus, the X; are i.i.d. U[0, 1]. Let C be the 
collection of all finite subsets of H. Let P, be the empirical measures defined 
by the given X;. Then ||P, ||c = 1/2 for n even and || Palle = (n+ 1)/(2n) 
for n odd. On the other hand if X; are coordinates on a countable product of 
copies of ([0, 1], à), then || Pallec is nonmeasurable. This illustrates that C is a 
pathological class of sets, but the pathology can be obscured if one does not 
define X ; as coordinates on a product space with product probability. 


Problems 


1. A law Q on R” is called exchangeable if it is invariant under all permutations 
fn of the coordinates. 


(a) Give an example of an exchangeable law which is not of the form P” for a 
law P. 


(b) Show that the empirical measure P, is sufficient for any set of exchangeable 
laws. 


2. (continuation) Give an example of a class of non-exchangeable laws on R? 
for which the empirical measure is still sufficient. Hint: Consider laws with 
X2 = 2X). 


3. An exponential family is a set {Pọ}oco of laws on a measurable space 
(X, A), all dominated by a o-finite measure u, having densities of the form 
dP /du = ef-8©) where © is a set, f maps © into R*, g maps X into R£, 
and (-, -) is the usual inner product in R*. If P is such an exponential family 
and P” := {P” : P e P} is the corresponding set of laws on X”, show that 
Y=: 2(X;) is a sufficient statistic for P”. 


4. (a) Show that the class C of initial segments Z, in the example at the beginning 
of Chapter 5 is admissible. Hint: The o-algebra generated by the sets J, is not 
separable, but the set Q has the smallest possible uncountable cardinality. Thus 
we can assume Q2 is in 1-1 correspondence with a subset of [0, 1] (cf. RAP, 
Appendix A.3). Use this to show that there exists a separable structure on Q for 
which the sets J, are all measurable. Then, see the discussion after Theorem 
5:26: 


(b) A measure space (X, S, p) where S contains all singletons {x} is called 
nonatomic if u({x}) = 0 for all x € X. Let S be a o-algebra of subsets of 
C from part (a) making it admissible, for some o-algebra B of subsets of Q 
containing all singletons. Show that there is no nonatomic probability measure 
on S. 
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5. (a) Let (X, €) be a measurable space such that € contains all singletons 
{x} for x € X and X is uncountable. Suppose that for any nonatomic law 
P on (X, €), P(A) = 0 or | for all A € €, and that there exists at least one 
nonatomic law P on (X, €). Show that € cannot be countably generated. Hint: 
Suppose Ax are generators. Let J be the set of all k such that Q(A,) = 0 for 
every nonatomic law Q on (X, €). Show that we can replace X by the subset 
Y := X \ Uey Ax with the relative o-algebra Ey := {ANY : A € E} and the 
assumptions still hold, where { B;};~; equalto {A4 N Y : k ¢ J} areanonempty 
collection and generate Ey. Thus we can assume that J = Ø, i.e., for all k, there 
is a nonatomic law Q; on € with Q;(A,) = 1. Then find a nonatomic law Q 
on € with Q(A,) = 1 for all k and consider N% Ax. 


(b) In the Remarks before Theorem 5.13, prove that 7 is not countably gener- 
ated. Hint: Use part (a) and the following: for any two sets X and 7, let X’ be 
the product space consisting of all functions from 7 into X. Suppose (X, A, P) 
is a probability space. Let A’ be the product o-algebra, the smallest o -algebra 
of subsets of X! for which the coordinate projection f œ> f(i) of X! onto X 
is measurable for each i € J. Let P’ be the product probability measure on 
A! for which the coordinates are i.i.d. P, which exists (RAP, Theorem 8.2.2). 
A 1-1 function z from J onto itself is called a finite permutation if (i) = i 
except for at most finitely many values of i. Such a x defines a 1—1 measur- 
able function T, of X” onto itself by Ta ({xi hier) = {Xr }icr. A set B € A’ is 
called invariant if it is taken onto itself by 7, for every finite permutation v. 
Apply the Hewitt-Savage 0-1 law (RAP, Theorem 8.4.6), which states that for 
any invariant set B, P’(B) = Oor 1. 


6. Let (S, d) be a separable metric space. Let Y1, Y2,..., be Suslin subsets of 
S, for the Borel o-algebra on each. Show that U;, Y,; and M; Y; are also Suslin. 
Hints: Let fy be Borel functions from Polish spaces Sọ onto Yp. For Uz Y% use a 
disjoint union of copies of Sp, and show it is Polish for a suitable topology or 
metric. For Ng Yx, use that the Cartesian product I], $, with product topology is 
Polish (RAP, Proposition 2.4.4 gives a metric, which is complete if the metric 
dą on Sx is for each k). Find a Borel set in the product which is mapped onto 
MY, by a Borel function, which is sufficient as noted soon after the definition 
of the Suslin property. 


7. Let (X, A) be a measurable space and V a finite-dimensional vector space 
of measurable real-valued functions on X. 


(a) Show that V is image admissible Suslin. Hint: See Theorem 5.26. 
(b) Show that {14 : A € pos(V)} is also image admissible Suslin. 


8. Let (X, A) be a measurable space and let C C A be image-admissible Suslin 
via some (Y, S, T), i.e. {1g : B €C} is. Let C be the union of all Boolean 
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algebras generated by k or less sets in C, as in Theorem 4.8. Show that C® 
is also image admissible Suslin. Hints: On any set with the o-algebra of all 
its subsets, specifically a finite set, all functions are measurable. Consider the 
following steps: 


(a) Given k, show that the set of all functions x > (1c, (x), ..., le,(x)) € 
{0, 1}* for any C;,..., Ck € C is image admissible Suslin, via Y*, the Cartesian 
product of k copies of Y with product o-algebra, and using (Iro, ..-, LTO) 
(b) Define a map from {0, 1}* into {0, 1)" taking any (1c,,...,1¢,) into a 


vector whose components give indicator functions of all atoms of the Boolean 
algebra generated by C1, ..., Cy and possibly of the empty set. 


(c) For any m = 1,2,..., define a function U from {0, 1}” = {b = {b} ‘= : 
b; =0 or 1 for all j} such that for every subset E C {1,...,m} there is an 
r=1,...,2” such that U(b), = max jeg bj. 

Combine steps (a), (b), and (c) and use as the Y space for C the product 


space Y* x F, where F; is the finite set {1,2,..., 2%} to get the solution. 


9. Let (X, A) be a measurable space such that {x} € A for all x € X. If P isa 
law on A, call a set A € A a soft atom if P(A) > 0, forall B C A with B € A 
either P(B) = P(A) or P(B) = 0, and P({x}) = 0 for all x € A. (A “hard 
atom” would be an ordinary atom, namely, a singleton {x} with Px} > 0.) 


(a) If X is a separable metric space with Borel o-algebra A, show that no law 
on A has a soft atom. Hint: Show that a soft atom has a sequence of soft atom 
subsets, decreasing to a singleton. 


(b) Give an example of a soft atom. Hint: Let X be uncountable and A consist 
of countable sets and their complements. 


Notes 


Notes to Section 5.1. Fisher (1922) invented the idea of sufficiency and Neyman 
(1935) gave a first form of factorization theorem. Halmos and Savage (1949) 
proved the factorization theorem, Theorem 5.1, in case h € L! (S, B, u). Vivid 
ad hoc concepts such as “kernel” and “chain” in the proof of Lemma 5.2 
are typical of Halmos’ style of writing proofs. Bahadur (1954) removed the 
restriction on h, by the proof given above in the proof of Theorem 5.1. 

Theorems 5.7 and 5.8 (sufficiency of P,,) follow from facts given by Neveu 
(1977, pp. 267-268). I thank Sam Gutmann for showing me the proof actually 
given in the section and Don Cohn for telling me about Neveu’s proof. 

I do not have references for Theorem 5.9, Lemma 5.10, or Corollary 5.12. 
Yet in some sense, it is well known that the empirical distribution function 
incorporates all the information in an i.i.d. sample (Corollary 5.12). 
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Notes to Section 5.2. Theorem 5.16, due to Lebesgue, is proved in Natanson 
(1957, p. 137). Theorem 5.15 is due to Aumann (1961), and the proof here to 
B. V. Rao (1971), except that the “image admissible” part was added in Dudley 
(1984). Freedman (1966), Lemma (5), proved that the o -algebra 7 mentioned 
before Theorem 5.13 is not countably generated. 


Notes to Section 5.3. There are several papers on selection theorems. Sainte- 
Beuve (1974, Theorem 3) gives a selection theorem close to Theorem 5.22, 
which covers Suslin topological spaces. Sainte-Beuve did not use the terminol- 
ogy “(image) admissible Suslin.” 

M. Suslin published in 1917 an example of a Borel set in R? whose projection 
into R is not Borel, disproving a statement in a 1905 paper by Lebesgue (on 
the history, see the notes to Section 13.2 of RAP). Such projections, or more 
generally direct images of Borel sets in Polish spaces by Borel functions, are 
sometimes called analytic sets. They are here called Suslin sets in honor of 
Suslin’s discovery of them, and because analytic sets are generally defined 
to be subsets of Polish spaces, e.g. in Cohn (1980) and in RAP. Here, Suslin 
measurable spaces are defined, not necessarily with any metric or even topology. 

Darst (1971) showed that even an infinitely differentiable (C°°) function can 
take a Borel set onto a non-Borel set in R. 

Originally (e.g., Dudley 1984, Section 10.3), the definition of image admis- 
sible Suslin required that (Q, A) also be a Suslin space. I thank Uwe Einmahl, 
who pointed out to me around 1988 that the Suslin assumption on & was unnec- 
essary. Iam grateful to the late Lucien Le Cam for stimulating discussions about 
this section. 

Durst and Dudley (1981) gave the example after Theorem 5.26. Effros 
(1965) is the original paper on the Effros Borel structure. Strobl (1994, 1995) 
and Ziegler (1994, 1997a,b) have given special attention to measurability issues 
for empirical processes. W. Adamski and P. Gaenssler told me in 1992 about 
the example at the end of the section. 
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Limit Theorems for Vapnik—Cervonenkis 
and Related Classes 


6.1 Koltchinskii—Pollard Entropy and Glivenko—Cantelli Theorems 


There are some good sufficient conditions for limit theorems (Glivenko— 
Cantelli or Donsker theorems) over Vapnik-Cervonenkis and certain related 
classes, using the following form of “entropy” or “capacity.” 

First, let (X, A) be a measurable space and F C £°(X, A), the space of all 
real-valued measurable functions on X. Recall that Fr(x) := sup{| f(x)|: f € 
F} (Section 4.8). Then F(x) = ||5,||¢. A measurable function F € £°(X, A) 
with F > Fr is called an envelope function for F. If FF is A-measurable it 
is called the envelope function of F. If a law P is given on (X, A), then F% 
for P will be called the envelope function of F for P, defined up to equality 
P-a.s. 

Let T be the set of all laws on X of the form n7! ae ôx) for some 
x(j)€X,j =1,...,n,andn = 1,2,..., where the x(/) need not be distinct. 
For ô > 0,0 < p < œ, and y €T recall (Section 4.8) that if F is an envelope 
function of F, 


DPG, F, y) := sup [m: for some fi, ..., fn € F, andalli Æ j, 


fit- hay >a f Fray]. 


Let DP(6, F) := supper DP (6, F, y). Here DP(8, F, y) is a kind 
of packing number, involving the envelope function F. The corresponding 
“capacity” will be the logarithm of D’. Such logarithms will appear in 
Section 6.3. 

Let G :={f/F : f € F}, where 0/0 is replaced by 0. Then |g(x)| < 1 for 
all g € G and x € X. Given F, p, and y €T let Q(B) := Sr FPdy/y (FP), 


(6.1) 


239 
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if y (F?) > 0. Then Q := Q; is a law and for 1 < p < œ, DP (8, F, y) = 
D (ô, G, dp,o) , where dp.o(f,g) := (SIF - gir aQ)” (as defined before 
Theorem 5.17). 

For example, if C is a collection of measurable sets, whose union is all of X, 
and F := {14 : A € C} the envelope functionis F = 1 and DPS, FV) = 
D(6, F, dy). The next few results will connect D6, F) with other ways 
of measuring the size of certain classes F. First, we have Vapnik—Cervonenkis 
classes C of sets with dens(C) < S(C) < +00 (Corollary 4.4): 


Theorem 6.1 [fC C A, dens(C) < +œ, 1 < p < œ, F € L?(X,A, P), F = 
0, and F := {F14 : A € C}, then for any w > dens(C) there isa K < œ such 
that 


DP (6, F) < K8”, 0<8<1. 


Proof. Lety €T and let G be the smallest set with y (G) = 1. We may assume 
F(x) > 0 for some x € G since otherwise for any K > 1, 


DP, F, y) = 1 < K8” for 0<8 <1. 
If C(1), ..., C(m) € C are such that [dp y (Flow, Flew)? > 6? f F?dy 


for i Æ j, with m maximal, then for Q = Q, 


O(CMAC(j)) = f F’dy /f F?dy > 8. 


COAC) 


Then by maximality and Theorem 4.47, there is a K(w, C) < oo such that 


D5, F, y) <m < D(8?,C, do) < K(w,C)s-””. 


Next, here is a kind of converse to Theorem 6.1: 


Proposition 6.2 Suppose 1 < p < œ, F = {1g : B € C} for some collection 
C of sets, F = 1, and for some ô with O < 6? < 1/2, DPS, F) < œ. Then 
dens(C) < S(C) < œ. 


Proof. First, dens(C) < S(C) by Corollary 4.4. Next, suppose C shatters a set 
G with card G = n = 2” for some positive integer m. Let y have mass 1/n 
at each point of G. Then G has m subsets A(i), i = 1,...,m, independent 
for y, with y (A(i)) = 1/2, i = 1,...,m (taking elements of G as strings of 
m binary digits, let A(i) be the set where the ith digit is 1). Then for i 4 k, 
J |La — Law|’ dy = 1/2. So for each j in (6.1) we can take f; = 14u). Thus 


D'?(5, F) > DS, F,y) > m. 


For large m this is impossible, so S(C) < oo. 
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Next, consider families of functions of the form F = {Fg : g € G} where 
G is a family of functions totally bounded in the supremum norm 


IIgllsup := sup |g(x)I, 
xeX 


with the associated metric dsup(8g, A) := ||g — Allsup and with || g||sup < 1 for all 
g EG. 
Proposition 6.3 If 1 < p < coand0 < e < 1, then for any such F, 

Dee, F) < D (e, G, ds) 


Proof. If dsup(g,h) < 6, then for all x, |Fg — Fh|? (x) < 6? F(x)P, so the 
result follows from the definition (6.1). 


Now we come to the Koltchinskii—Pollard method of symmetrization of 
empirical measures. Given n = 1,2,..., let x1,...,X2, be coordinates on 
(x, A”, P”), hence iid. P. Let Q, := {0, 1}” with the uniform proba- 
bility distribution U,, giving probability 1/2” to each point. Thus the coordi- 
nates e;, i = 1,...,n on Q, are i.i.d. and equal 0 or 1 with probability 1/2 
each. Let o(i) := 2i — e; and t(i) := 2i — 1 + e; for each i = 1,..., n. Take 
the product space (X”, P”) Q (Qa, Un), so that ETR is independent of 
{ej o (j), T(J) ar Then x(o(j)) for j = 1,...,n are i.i.d. P. Let 


n 

fe =f n yo -1 

P) = n! be Pr = n Y hao 
j=l 


š 1/2 ‘i 1/2 
v = (Pla P), w := n'? (P! — P), 
O an 0 ua 1/2 p0 
Pi = P= Pys Va n, 


The variables ¢; = 2e; — 1 are i.i.d. Rademacher variables, having values 
+1 with probability 1/2 each. Let E,, be their joint distribution. We have the 
following alternate representation: 


Proposition 6.4 In X” x X” x {-1, 1}, 

(okap roka (Eil) 
has the same distribution as 

({x; yey , {Xp + ater ’ {ej K1) ’ 


namely, P” x P” x Ep. 


Proof. For any given values of £; = 2e; — 1 or of e;, {Xo}; and {xro}; 
are independent, each with distribution P”. The conclusion follows. 
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It follows that v’ and v” are two independent copies of v, = /n(P, — P). 
Here is a symmetrization fact. The Lemma will be used to upper-bound the 
probability on the right side by (1 — ¢77~7)~! times the probability on the left, 
which is easier to bound by symmetry. It may be compared to the inequalities 
of P. Lévy, Theorem 1.20, a reflection principle, and that of Ottaviani, Theorem 
1.19. The proofs of both those inequalities use a stopping time, the first time 
that partial sums of independent real random variables reach or cross a given 
level. Here the class F has no linear ordering and we cannot choose a “least” 
f. Instead, measurable selection, from Theorem 5.22, is used in place of a 
stopping time. 


Lemma 6.5 Lett > Qand F C L7(X, A, P) with f f?dP < ¢* forall f € F. 
Assume F is image admissible Suslin via some (Y, S, T). Then for any n > ¢ 


Pr {v> n} = (1-627?) Prille > 2n}. 


Proof. The given events are measurable (up to sets of probability 0) by Propo- 
sition 5.20(c) for v, or (b) for vo, and Corollary 5.25. For x = (x1, ..., X2n) 
let H := f(x, w, y) EX” XQ XY: TO) > 2n}. Using the second of 
the alternate representations in Proposition 6.4, we can take P” = 1 Aai Xn+j- 
Let x! = {xn4j}j_, and & = {xj}; so that € and x’ are iid. P” and v; 
depends only on x’, not £ or œ € Qy. Let Hı be the set of (x', y) such that 
l£, xX, w, y) € H for some or equivalently all € € X” and w € Q,. Then by 
image admissibility, Hı is a product measurable subset of X” x Y. The Suslin 
property implies (Theorem 5.22) that there is a universally measurable selector 
h such that whenever (£, y) € Hı for some y € Y, h(E) € Y and (&, y) € Aj. 
Here h(x’) is defined if and only if x’ € J for some u.m. set J C X” by Theorem 
5.22. On the set where x’ € J, since v? = v! — w, 


n = 


Pr (|v |> > mlx’) = Pr (m mO] < nlx’). 
Given x’, T(A(x’)) is a fixed function f € F with f f?dP < ¢7. Next since £ 
is independent of x’, we can apply Chebyshev’s inequality to obtain 
Pr (| < n) = 1- Em. 
Integrating gives Pr { | vo l- > n} > (1 — (¢/n)’) Pr {| vr l- > 2n}, which, 


since v” is a copy of v,, gives the result. 


Some reversed martingale and submartingale properties of the empirical 
measures P, will be proved. Recall that Q(f) := f f dQ for any f € L!(Q), 
and that in defining empirical measures P, := 1 Xa ôx,, the Xj are always 
(in this book) taken as coordinates on a product of copies of a probability space 
(X, A, P), so that the underlying probability measure for P, is P”, a product 
of n copies of P. 
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Here are some definitions for reversed (sub)martingales. Let (Q, 6, Q) be 
a probability space. For n > 1 let 6, be a sub-o-algebra of 6 such that 6, D 
Br+1 for all n (the reverse of the usual inclusion for martingales). For each 
n let X, € £L'(Q, Ba, P). Then {X,} is called a reversed submartingale if 
E(Xn|Bn+1) > Xn41 almost surely for all n, or a reversed martingale if the 
same holds with = in place of >. 


Remarks. In some presentations (including RAP), the index integers are taken 
to be negative, so that we have 6, C 6,4) for n = —2, —3,....Here, however, 
positive integers have been preferred, as in P,, we have n positive. 

For each n, let D, be the smallest o-algebra for which all X,, for m > n 
are measurable. Then from the definitions, 6, D D, for all n. In some pre- 
sentations, 6, is taken to equal D,. Very often (e.g., in RAP), a conditional 
expectation E(X|A) is required to be an A-measurable function. Here, how- 
ever, it may equal such a function almost surely. We need to allow comple- 
tion measurability (adjoining sets of probability 0) to apply some facts in 
Chapter 5. As conditional expectations are only defined up to almost sure 
equality, completion does not make a great difference. 


Theorem 6.6 Let (X, A, P) be a probability space with (X, A) separable, 
F CL'(P), and P, empirical measures for P. Let S, be the smallest o -algebra 
for which P,( f) are measurable for all k > n and all f € L! (X, A, P). Then: 


(a) Forany f € F, {Pa(f), Sn}n>1 is a reversed martingale; in other words 
E(Pr-1(f Sn) = Pa (f) a.s., ifn > 2. 


(b) (F. Strobl) Suppose F has an envelope function F € L!(X, A, P) and 
that for each n, ||P, — P || is measurable for the completion of P”. 
Then (|| Pa — P || F , Sn)n>1 is a reversed submartingale; in other words 
we have || Pai — Pile < E(| Pa — PllrlSn41) a.s. for all n. 


Remark. ||P, — P || will be completion measurable if F is image admissible 
Suslin, by Proposition 5.20(c) and Corollary 5.25. 


Proof. For each n, any set in S, is invariant under permutations of the first n 
coordinates X;, where P, := 1 (5x, + + dx,). So if 1l<i<j<n, 
E(f(X))|S,) = E (FX DIS). Summing over i = 1,...,k and dividing by 
k gives E (Py(f)|Sn) = E (f(X1)|S,) fork = 1, ... , n. Letting k = n — 1 and 
n proves (a). 

Proof of (b): clearly ||P||z- < PF <œ and ||P,||7 < P F. Since by 
assumption ||P, — P || is completion measurable, we have E||P, — P||F < 
2PF < œ. 
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Next, it will be shown that || P, — P || F is measurable for the completion of 
Sn. It is enough to show that for each rational q > 0, the set A, := Ag,» where 
||P. — P|lz > q is measurable for the completion of S,,. Let 


X” := {xj} joi 2 XP E X}= X” x x” 


where X™® = {{xj}jsn : xj € X}. The set A, itself is invariant under all trans- 
formations in the set II, of permutations of the first n coordinates. Here I, may 
be viewed as acting either on X® or on X”. Let A”! := {B x XM: Be A} 
and P"(B x X™) := P"(B). By assumption there are sets C,, D, in Al”! with 
Cy C Ag C Dg and P” (D; \ Cy) = 0. Each image of C, by an element of 
TI, is also in A”! and included in Aq. The union U; of these n! images, being 
in A” and symmetric under all n! permutations of the indices 1,..., 7, is 
S,-measurable by Theorem 5.8, and included in A,. Likewise, the intersection 
F, of all images of D; by transformations in I], is S,-measurable and includes 
Aq. Clearly P” (F; \ Ug) = 0, so Ag and ||P, — P || F are measurable for the 
completion of S,, as desired. 

For i=1,...,n+1 let Pg t= > oe t j=1,... n41, 7 Fil, 
Then since the X; are coordinates on a product space, P,; has the same 


properties as P, for each i, and Pp, n41 = Pa. Thus || Pai — P||¢ is completion 
measurable with respect to A”+!, 

Now, the conditional expectations E ( Pri — Pil FISn41) are all the same 
(equal to each other almost surely) fori = 1,..., n + 1, so all are a.s. equal to 


E (|| Pa — PllzlSn+41), because for each i, a transformation defined by a per- 
mutation of the first n + 1 coordinates takes P,,; into P, while leaving all sets 


in the o-algebra S,,,, invariant. Then for any n = 1, 2,..., we have 

1 n+1 1 n+l 
P, — P] = P,.—P < —— Pai — P || F, and 
Pout -Ple = 5 2 ni | Saad! ni — Pll 


| Prov — Pir = E (|| Pati n PilelSn41) 


n+l 


XO E (Pai — Pllz|Sn41) = E (Pn — PIFS) 
=1 


<< 
z n+l‘ 


a.s., finishing the proof. 


Recall DP (6, F) as defined after (6.1). Here is a law of large numbers 
(generalized Glivenko—Cantelli theorem): 


Theorem 6.7 Let (X,.A, P) be a probability space such that (X.A) is separa- 
ble, F € L'(X, A, P), and F a collection of measurable functions on X having 
F as an envelope function. Suppose F is image admissible Suslin via (Y, S, T). 
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Assume that 
D6, F) < co forall 6 > 0. (6.2) 
Then limy-s ||P — P\|- = 0 a.s. 


Proof. By Theorem 6.6, {||P — PIF, Sn},>1 is a reversed submartingale. 
Being nonnegative, it converges almost surely and in L! (RAP, Theorem 10.6.4). 
It will now be enough to show ||P, — P || F — 0 in probability. 

Given € > 0, take M > 1 large enough so that P (Fl r.y) < ¢/4. Then 


(Pa = P)lr>ml|F = |Palesulle + P(Flrsm) 
< (Pit P)(Flr>m) > 2P (Fl rem) < €/2 


almost surely as n — oo. Replacing each f € F by flr<m takes F onto 
another class of functions which is still image admissible Suslin since 
(x, y) > T(y)()1 Fq)<m is jointly measurable while the Suslin space (Y, S) 
is unchanged. So we may assume F < M. 
Next apply Lemma 6.5 with ¢ = M and n = n'/*e/4 > 2M for n large. 
Then it is enough to show that || P| r= |P; — P! ||- — 0 in probability. 
For any f € F, we can write 


I 


1 n 
PXf) = 2, ej[f (x2) — f(x2;-1)] (6.3) 


where £; are Rademacher variables independent of x = {x ee as in Proposi- 
tion 6.4. Let w := {ej}ja1 € E := {-1, 1}” with distribution E,,. The function 


1 n 
(x, 0, Y) > PTO = =] [127 Ciy = 8r) | TOD 
j=1 


is jointly measurable by image admissibility (Proposition 5.20). Then by Corol- 
lary 5.25, || P? || 7 is jointly completion-measurable in (x, œ). 

For n > no large enough, 5 P,(F) < ¢/4 ona set X’ of x with P” (X^ > 
1 — £/4. Let 5 := £/(9M) and K(8) := DY (6, F). Let A := A, be the event 
that || P°|l F > €/2. Then by the Tonelli—Fubini theorem 


Pr(A,) < a +f [rac odE (ord PC, (6.4) 
4 Ix Je 


For each fixed x € X’, an upper bound will be given for the inner integral. There 
exist m = m(x) < K(6) functions fi, ..., fm € F such that for any f € F, for 
some j < m, 


Pan f — fil) < ê Pn (F) < €/4, 
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by the definitions of K‘(5) and X’. Then 


[PF — f) — PY - Ff) 
2Pon (If — fil) < 26Pn(F) < €/4. 


IPIC) — PIF;)| 
< (Pi + PY) (If - fil) 


From the Rademacher distribution we have f P?( f)dE,(@) = 0 for all x. We 
also have | f(x2;) — f(%2;-1)| < 2M for each f € F and each j. It follows that 
for any x € X’, the variance of P°( f) with respect to En (w) is at most 4M°/n. 
Thus, denoting probability with respect to E,,(@) for fixed x as Pr, x, we have 
that 


Prax [max [PCF] 2/4} 
Jam 
K(8) sup {Prr x {|P2(f)| > £/4} : IFI < M} < 4K()M*(4/e)/n < e/4 


for n large using Chebyshev’s inequality. So, integrating with respect to the 
distribution of x € X’, we have for n large enough the unconditional probability 


E E 
Pr {|| PP |, > £/2} < aa as 


proving Theorem 6.7. 


Vapnik and Cervonenkis first proved the following (under a different mea- 
surability assumption). 


Corollary 6.8 (Vapnik and Cervonenkis) Let C C A where (X, A) is a sep- 
arable measurable space, S(C) < œ, and C is image admissible Suslin. Then 
for any probability law P on A, we have 


lim sup |(P, — P)(A)| = Oas. 
n00 AeC 


Proof. In Theorem 6.7, take F = 1 and apply Theorem 6.1 with p = 1. 


Corollary 6.9 Let (X, A, P) be a probability space, and F a VC subgraph 
class of measurable functions on X with an envelope function F € L!(X, A, P). 
Suppose F is image admissible Suslin via (Y, S, T). 

Then limy- 09 || Pn — P || F = 0 a.s. 


Proof. By Theorem 4.52 with p = 1, the hypothesis (6.2) of Theorem 6.7 
holds. As the other hypotheses are assumed, the conclusion holds. 
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6.2 Glivenko—Cantelli Properties for Given P 


The laws of large numbers (Glivenko—Cantelli theorems) uniformly over a class 
C of sets, for a given P, to be proved in this section, are essentially due to Vapnik 
and Cervonenkis (without assuming that C is a VC class, a case already treated 
in Corollary 6.8) and to J. M. Steele; only the measurability assumptions here 
differ from theirs. 

Let (X, A, P) bea probability space and C C A. Let {x,},, be coordinates 
in (X%, A®, P) so that x; are i.i.d. P. For certain classes C with S(C) = +00 
one will have Af ({x,,..., x,}) < 2” with P"-probability converging to 1. For 
such classes, with sufficient measurability properties, a law of large numbers 
will still hold. Here a main result is as follows: 


Theorem 6.10 If (X, A, P) is any probability space with (X, A) separable, 
C C A, and C is image admissible Suslin, then the following are equivalent: 


(a) ||P, — Plie > 0 a.s. as n > oo; 
(b) || Pa — P\le — 0 in probability as n > œœ; 
(c) limy+oo n`! E log AC ({x1,...,Xn}) = 0. 


For a finite set F C X and collection C C 2*, letk°(F) := S(Cn F). 


Lemma 6.11 Under the hypotheses of Theorem 6.10, both AĈ (x1, ..-, Xn) and 
kE ({x1, ..., Xn}) are universally measurable. 


Proof. Let C be image admissible Suslin via (Y, Y, T). For any decompo- 
sition D of {1,...,n} into subsets, let X} be the set of all ordered n-tuples 
(X1,-.+, Xn} such that x; = x; if and only if i and j belong to the same subset in 
D. Then X% is A” measurable as follows: by Theorem 5.13, up to measurable 
isomorphism, we can take X C [0, 1] with A as the Borel o-algebra of X as a 
metric space (or equivalently, the intersections with X of Borel sets in [0, 1]). 
In that case, each set {x : x; = xj} fori # j is clearly measurable. It is enough 
to prove the Lemma on each X% ; specifically, where D is the decomposition 
into singletons {7}, the set on which the x; are all different. 

For each set J C {l,...,m}, and x := (x1,...,%n), U(J) i= {(x, y) i xj € 
T(y) if and only if j € J} is product measurable by image admissibility. Thus 
its projection TIU(J) into X” is universally measurable by Theorem 5.22. Now 


AC CXisec65. Xa} = pee Inui) and 


6.5 
{Ke (xi, APR Xn}) 2 m} = Ueen sei n}, |G|=m Nace IU(H). i l 


The main step in Steele’s proof uses Kingman’s subadditive ergodic theorem 
(the equivalent superadditive ergodic theorem is RAP, Theorem 10.7.1). To state 
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it, here is some terminology. Let N denote the set of nonnegative integers. A 
subadditive process is a doubly indexed array {Xmn}o<m<n<oo Of real random 
variables, m,n € N, m < n, such that 


Xin <Xkm +Xmn Whenever k<m <n. (6.6) 
Let Xny := 0 for all n € N. If instead of (6.6), 
Xin = Xim +Xmn, kK<m<n, (6.7) 


then {Xmn}o<m<n 18 called superadditive. 

A process which is both subadditive and superadditive is called additive and 
can clearly be written as x4, = ieee Xj, Where x; := Xxj~1,;; Le., one has 
just partial sums of a sequence of random variables. 

A subadditive process {Xmn}o<m<n defined on a probability space (Q, B, Pr) 
will be called stationary if there is a measure-preserving transformation V of 
Q onto itself such that for any integers 0 < m < n, Xnn(V(@)) = Xm4i.n41(@). 
Recall that (f o g)(x) := f(g(x)). Let VE := Vo(Vo(---oV):--) tok 
terms. Then for k = 1,2,..., XmnoV* = Xm+kn+k- Let S be the o-algebra 
of all B € B such that V—!(B) = B. 

Another useful hypothesis for subadditive processes is: 


For each n € N, E|xo,| < +00, andk« := inf Exon/n > —c. (6.8) 


A o-algebra D C B will be called degenerate if Pr(D) = 0 or 1 for all D € D. 
Here is Kingman’s subadditive ergodic theorem. 


Theorem 6.12 (Kingman) Let {Xin }o<m <n De a Stationary subadditive process 
satisfying (6.8). Then as n —> œ, Xon/n converges a.s. and in L! to a random 
variable y := infn> n—'E (xon|S) with Ey = x. If S is degenerate, y = x 
a.s. 


Proof. RAP, Theorem 10.7.1 applies with f, there defined as — xon. From near 
the end of the proof we have Ey < Exo,/n for all n and Exo,/n —> Ey as 
n —> œ, s0 Ey =K. 


To apply Theorem 6.12 in proving Theorem 6.10 we have: 


Theorem 6.13 Let (Q, T,Pr) be a probability space, (X, A) a measurable 
space, X, a measurable function from Q into X, and V a measure-preserving 
transformation of Q onto itself. Let X; := Xı o Vi"! for j =2,3,.... Let 
C be image admissible Suslin and Ø #C C A. Then each of the following is a 
stationary subadditive process satisfying (6.8): 


(a) DEn := SUPAec Doneis (La(X;) — P(A)) 


(b) log AS, = log A0 UX ney ces Xn D; 
(c) ke = ke AX m+, e.’ Xn). 


’ 
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Proof. We have measurability by Corollary 5.25 in (a) and Lemma 6.11 in (b) 
and (c). Stationarity clearly holds for the same V in each case. Subadditivity is 
clear in (a) and not difficult for (b) and (c). All three processes are nonnegative: 
in (b), C nonempty implies A® > 1, so (6.8) holds. 


Now Theorem 6.10 will be provided. The quantities || P,, — P ||c are comple- 
tion measurable by Corollary 5.25. The probability space, as usual, is a count- 
able Cartesian product of copies of (X, A, P), with coordinates x;,x2,.... 
The measure-preserving transformation will be V({x;}j>1) := {xj+41}j>1- By 
the Kolmogorov 0-1 law (RAP, Theorem 8.4.5), S is degenerate. Then by 
Theorems 6.12 and 6.13, we have almost sure limits 


cı = lim DE jn, c= lim (log AV Jit c3 := lim kp,/n (6.9) 


for some constants c;. Thus in Theorem 6.10, (a) and (b) are equivalent 
(as also shown in the proof of Theorem 6.7 via the reversed submartingale 
property). It remains to prove (b) equivalent to (c). 
To show (c) implies (b), given 0 < € < 1, first, apply the symmetrization 
Lemma 6.5 with ¢ = 1 and n = en'/*/2 for n large enough so that n > 2, 
giving 


Pr {|| Pa — Plic > £} < 2Pr{||P, — P|, > €/2}. 


le 


Thus it suffices to show || P; — P?’ | c — 0 in probability (these variables are 
universally measurable as usual by Proposition 5.20 and Corollary 5.25). For 
any fixed set A € A, the conditional probability 


Praen = Pr{|Pp(A)| > elixa} 


2 
{x; i 


= Prt} e (8x, — ôn) (A)] > ne 
j=l 


where £j := 21,5(;)=2;; — 1 are Rademacher variables independent of the x; as 
in (6.3). Thus by Hoeffding’s inequality, Proposition 1.12, with a; = —1, 0 or 
1, Prá en < 2exp (—ne?/2). 


For some no, E log AS, < net forn > no. Then by Markov’s inequality, 


Pr (log AG, > ne?) < €. 


Given pee the event | P?(A)| > £ is the same for any two sets A having 


the same intersection with {x,,..., X2,}. Thus 


Pr {|| le > el pi} = Ne exp (—ne?/2) s 
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Hence, for € < 1/8 and n > no large enough so that exp (—ne?/4) < £, we 
have 


Pr (|| P? |e > £) <e+2exp (2ne* — ne? /2) 
<e+2exp (—ne?/4) < 3e, 


so (c) implies (b). 

Now it will be shown that (b) implies (c). By (6.9) we have 1> 
n~! log AC, — c a.s. for some constant c := c2 > 0. Thus n~! E log AS, > c 
and we want to prove c = 0. Suppose c > 0. Given ¢ > 0, for n large enough 


Pr {(2n)~' log Af 5, > c/2} = Pr{A§,, > e} > l—-e. 
Next, to symmetrize, 


Pr {|| P; — P|, > 2e} 


A 


Pr {|P -Ple >e} +Pr{|P} -Ple >e} 


= 2Pr{|| Pa — Ple > e€}. 


le 


So it will suffice to prove || P; — P} ||e > O in probability. If 2 < k := [an] 
where [x] is the largest integer < x and0 <a < 1/2,thenan < k + 1 < 3k/2, 
so by Proposition 4.3 and Stirling’s formula (Theorem 1.17) we have 


nC < (2ne/k)* < Beja)”. 


Asa | 0, Ge/a)* —> 1. Thus for œ small enough, (3e/a)* < e°. Choose and 
fix such an a > 0. Then for n large enough, 


n>2/a and (3e/a)*" < e". (6.10) 


Hence by Sauer’s theorem 4.2, if Afo > e”"°, then KS on > [an]. Fix n satisfy- 
ing (6.10). Let k := [æn]. 

Now on an event U with Pr(U) > 1 — e, there is a subset T of the indices 
{1,..., 2n} such that card T = k, C shatters {x; : i € T}, and x; Æ x; fori 4 j 
in T. If there is more than one such T, select each of the possible T’s with 
equal probability, using for this a random variable Y independent of x; and oj, 
1 < j < 2n. Then since x; are i.i.d., T is uniformly distributed over its A 
possible values. For any distinct j; € {1,..., 2n}, N := 2n, we have, where 


the following equations are conditional on U, 
Pr(j €T) = k/N, 


Pr(jj, j2 ET) = k(k—1)/N(N —!), (6.11) 


(9/6) 


Pri €T, i=1,2,3,4) 
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Let M, be the number of values of j < n such that both 27 — 1 and 2j are in 
T. Then from (6.11), 


EM, = k(k—1)/2(N—-1) = &n/4+ 00), n>; 


EM? = re- D2w -D+ (4) (7). 


and a bit of algebra gives o°(M,) = EM? —(EM,)? = O(n) asn —> oo. Thus 
for 0 < 6 <a*/4, by Chebyshev’s inequality, Pr(M, > ôn) > 1 — 2e for n 
large. On the event U N {M,, > ôn}, let’s make a measurable selection, Theorem 
5.22, of a sequence J of [ôn] values of i such that J’ := Jez (2i —1,2i}c 
T. Here M, and J are independent of the o(j). Now measurably select, by 
Theorem 5.22 again with y(-) = H(-), a set A = A(w) = T(y(@)) € C such 
that{j € J’: x; € A} = {o(i): i € J}. Then X ;ṣ; (ôro — ôro) (A) = [nô]. 
Here y(-) is measurable for the o-algebra 5; generated by all the x;, by Y, and 
by o(i) fori € J. Conditional on B,, 


D (ôro ~ Sro) (A) = Pa Sj aj 
i¢J i¢J 
where a; are B;-measurable functions with values —1, 0, and 1, and s; have 


values +1 with probability 1/2 each, independently of each other and of 5;. 
Thus by Chebyshev’s inequality 


Pr | X sia; > 18/3 


i¢J 


s < 9/ (n?) 


on the event where J is defined. Thus for n large 


Pr ((P, — P,’) (A()) > 8/3) > 1— 3e 


and || P; — P? | ¢ 7 0 in probability. So Theorem 6.10 is proved. 


Theorem 6.14 In (6.9), if any c; is 0, all three are 0. 


Proof. In the last proof we saw that cı = O if and only if c2 = 0 and just 
after (6.10) that if c2 > 0 then c3 > 0. On the other hand if c3 > 0, then 
for some ô > 0, ke > ôn for n large enough a.s., and then AC. > 2°". Thus 
c > c3log2 > 0. 


6.3 Pollard’s Central Limit Theorem 


By way of the Koltchinskii—Pollard kind of entropy and law of large numbers 
(Section 6.1 above) the following will be proved: 
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Theorem 6.15 (Pollard) Let (X, A, P) be a probability space and let F C 
L?(X, A, P). Let F be image admissible Suslin via (Y, S, T) and have an 
envelope function F € £L7(X, A, P). Suppose that 


1 (2) 1/2 
f (log DÊ (x, F)) dx < 00. (6.12) 
0 
Then F is a Donsker class for P. 
Before the proof of the theorem, here is a consequence: 


Theorem 6.16 (Jain and Marcus) Let (K, d) be a compact metric space. Let 
C(K) be the space of continuous real functions on K with supremum norm. 
Let X1, X2,... be i.i.d. random variables in C(K). Suppose EX,(t) = 0 and 
EX (t? < œ for all t € K. Assume that for some random variable M with 
EM? < oo, 


|X1(s) — X1(t)|(@) < M(@)d(s,t) forall w and s5,teEK. 


Suppose that 
1 
f (log D(e, K, d))!" de < 00. (6.13) 
0 


Then the central limit theorem holds, i.e., in C(K), £L (n-/2(X, +- + X,,)) 
converges to some Gaussian law. 


Proof. For a real-valued function h on K recall (as in Section 3.6) 


nln = DEY IES ED Alls := sup |h(t)I, 
sft f 


Allez := llliz + llls BL(K) := {h € C(K): ||hllaz < œ}. 


To apply Theorem 6.15, take as probability space X = BL(K), A = o -algebra 
induced by the Borel sets of || - ||sup (or equivalently evaluations at points of 
K). Let P = £(X}), F(A) := ||Allp_, h € X. Then for any s € K, 
E\|Xluy < 2E (xo? + M(o) sup d(s, n) < 0 
t 

and E || X; \I7 < EM(œ <œ, so EF? < œ (note that F is measurable, 
F € £L°(X, A, P)). 

Let G be the collection of functions 6,/F : ht h(t)/F(h), h € X, where 
we replace h(t)/F(h) by 0 if h = 0 in BL(K), and t runs through K. Then 
|g(h)| < 1 forall g € G and h € X. Let 


F := {Fg:2 €G} = {6 := (ho A(t): te K}. 
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Foranys,t € K andh € X,|(6;/F — 6,/F)(h)| < d(s, t). Then by Proposition 
6.3, fr0 <x < 1, 


Da, F) < D&,G, dap) < D, K, d). 


Thus (6.12) holds and Theorem 6.15 applies to give that F is a Donsker 
class. Since F is the set of evaluations at points of K, uniform convergence 
over F (as in the definition of Donsker class) implies uniform convergence 
of functions on K. Since BL(K) C C(K), which is complete for uniform 
convergence, the limiting Gaussian process Gp for our Donsker class must 
also have sample functions in C(K) (almost surely). Since C(K) is separable, 
the laws £ (n™"?(Xı +--+ X,)) are defined on all Borel sets of C(.K) and 
converge to L(G p). 


Remark. In the situation of Theorem 6.16, K may be given originally with a 
metric e. The metric d may be chosen, perhaps as a function d = f(e), where 
f(x) may approach 0 slowly as x | 0, e.g., f(x) = x° fore > 0 or f(x) = 
1/ max (| log x|, 2). Thus one can increase the possibilities for obtaining the 
Lipschitz property of X, with respect to d, so long as (6.13) holds for d. 


Now to prove Theorem 6.15 we first have: 


Lemma 6.17 Let (X,.A, P) be a probability space, F € L7(X, A, P) and F C 
L?(X, A, P) having F as an envelope function. Let H :=4F? and H := 
{(f - gy : f, g € F}. Then 0 < g(x) < A(x) for all ọ € H and x € X, and 
for any ô > 0, 


DY (45, H) < DO, FY. 


Proof. Clearly 0 < ọ < H for y € H. Given any y € T, choose m < DP (8, 
F) and fi, ..., fm € F such that (6.1) holds with p = 2. For any f, g € F, 
take i and j such that 


max (YCF — FD. yg — FP) < Py). 
Then by the Cauchy-Bunyakovsky—Schwarz inequality, 
y((f-s¥ -Gi-fiY) = v(f-e-ft+ AG -st+h-fp) 
< v(f-fi- e- Fy") Ay (FY 
8ôy (F°) = 25y(H). 


A 


IA 


Thus letting hga, j) := fi — f; where kG, j) := mi —m + j, i, j =1,...,m, 
we get an approximation of all functions in H, in the £!(y) norm, within 
26y (H), by functions he, k =1,...,m?, which implies the Lemma. 
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Lemma 6.17 gives in particular that if DOS, F) < œ for all 5 > 0, then 
De, H) < oo for all € > 0. Thus hypothesis (6.12) lets us apply Theorem 
6.7, with F there = H. 


Proposition 6.18 Let (X,.A, P) be a probability space where (X, A) is separa- 
ble and F an image admissible Suslin class of measurable real-valued functions 
on X with an envelope function F € L(X, A, P). If DOB, F) < œ for all 
B > 0, then F is totally bounded in L?(X, A, P). 


Proof. We may assume f F?dP > 0 , as otherwise F is quite totally bounded. 
The class {( f — g}? : f, g € F} is image admissible Suslin by Theorem 5.26. 
By Lemma 6.17 and Theorem 6.7 we have 


sup {|(P, — P)((f — 8))|: fg eF}>0 as,n— oo. (6.14) 


Also, as n — 00, f F*d Px, > f F*dP as. Given € > 0 take nı large enough 
and a value of Py», n > nı, such that f F2dP>, < 2 F?dP and 


sup {|(Pn — P)((f —8))|: fg EF} < €/2. 


Take 0 < £ < (e/(4P(F2)))'” and choose fi, ..., fn € F to satisfy (6.1) for 
6 = B, p = 2, and y = Pn. Then for each f € F we have for some j 


J G- far < 54+ J (f- f)?dPn < 546? J FP, aÈ 


To continue the proof of Theorem 6.15, assume as we may that P(F?) > 0. 
Consider the L? pseudometric ep( f, g) := (P((f — g)*))'” for f, g € L°(P). 
In Theorem 3.34(II) for t = ep, we have proved the total boundedness. It 
remains to check the asymptotic equicontinuity condition. Let 0 < € < 1. For 
any ô with O < ô < ¢/2, let 


F6) :={f 8: fig E€ F, ep(f, 8) < ô}. 


Then F (ô) is image admissible Suslin for each ô > 0 via some (V, 8, T) by 
Theorem 5.26 and Corollary 5.18. It will be enough to show that for ô small 
enough and ng large enough, and any n > no, 


Pr(||Vnll Fs) > 6e) < Se. (6.15) 


By the symmetrization lemma 6.5 applied to F(6), n = 3e,and¢ = ô < €/2, 
we have 


36 
Pr(I[Pallz@) > 68) < 35 Pr(|lvpllz@) > 38). (6.16) 


Thus to prove (6.15), given € > 0, it will suffice to find a 5 > 0 small enough 
so that for n > no large enough, 


Pr((|v° llre > 32) < 4e. (6.17) 
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A suitable value of 5 will be found only late in the proof, in (6.30). For 
5 ETI Ee X”, wis {ejFiny e€ E := {-1,1}", and y € YV, (x, œ, y) => 
v9(t(y)) is jointly measurable by Proposition 5.20. It follows that (x, œ) œ> 
|| vo || zs) is jointly completion measurable by Corollary 5.25. Thus the event 
A = An 5,¢ in (6.17) is jointly completion measurable. For any real-valued func- 
tion g on X let ||gllon := (Pmn(g2))/”. Let By := {I|FII3, < 2P(F2)}. Then 


Pr(B,) > 1 —e for n > m = nde) (6.18) 


for some nz not depending on ô. By (6.14), which applies, we have given ô > 0 
and € > 0 for n > n,(6, £) large enough 


sup { (Pa — P)((f — 8)”)|: fg eF} <8? (6.19) 


on an event C, = C,,(6, €) with Pr(C,,) > 1 — £. On C,, we will have for all 
fg € F with f — g € Fs, 


Pon((f — g)°) < 28°. (6.20) 


Let X’ := X’, := B, N Cn. Similarly as in (6.4), we have 
Pr(An.s,2) < 2e +f [ 14x, w)dE,(@)d P” (x). (6.21) 
xi 
Again an upper bound for the inner integral (in fact, a finite sum over 2” points) 
[ la(x, w)d E, (Œ) (6.22) 


will be sought. In doing this, we have a fixed x € X”. Some choices depending 
on x will be made. These need not be made as measurable functions of x, as 
long as the eventual upper bound for (6.22) is measurable in x. Let 6; := 2~', 
i = 1,2,.... For our fixed x, choose finite subsets F(1, x), F(2,x),... of F 
such that for alli, and f € F, 


min {ll f — lla: 8 € FE, < dll EF llon (6.23) 
with k(i, x) := card(F (i, x)) < DP (ê, F). We can write 
T a ae 


For each f € F, let fi := g := gim € F(i, x) achieve the minimum in 
(6.23), with m minimal in case of a tie. Now || fj — fll2, —> 0 as i > œ 
by (6.23), and for any fixed r, f — fr = ae fi — fj-1 pointwise on 
S = {x1,..., Xan}. 
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Let H; := log DÊ, F), j = 1,2, .... The integral condition (6.12) is 


equivalent to 
oo 


yn < o. (6.24) 


j=1 
For all x and j, card F(j, x) < exp(H;). Let 


nj = max (j5;,(S76P(F)3;H))'”) > 0. 


Then 
5 Nj < ©, (6.25) 
jz1 
n? = 576P(F*)5;H;, and (6.26) 


X exp (=n; /(2885; P(F?))) < X exp (—j°/(288P(F°)) < 00. (6.27) 


j=l jz1 


It follows that, with Pr, denoting probability with respect to E,,(@) for our fixed 
x eX) CX”, 


Pr, [sup per uC F ~ f| > Dee ni| 


2 X Pr {sup rex AGH = fj-1)| > nj} 
jor 

< 2 expCH;) exp(H;-1) sup Pry (F; — Fi-D| > mi}, (6.28) 
j>r S 


since there are exp(H;) possibilities for f;,i = j — 1, j. Fora fixed j and f let 


zi = (fj — fi-a) — (Si — fj-1)&zi-1). 
Then by (6.3), 


n 
V — fj) = nY eiz 


i=l 
for the Rademacher variables ¢;. By an inequality of Hoeffding (Proposition 
1.12 above) 


Pr, [n Ea &izil > nj} < 2 exp (—4nn? ea ži) š 


On B,, we have 
n A 3 2 
Diet 4m [H PiPPa < An (US fila FF = Fpl) 
i=1 


< 4nl|F l, (6; + 8-7 < 72nd; PCF?) 
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by (6.23) and the few lines after it. Then the last sum in (6.28) is less than 


X exp(2H;)2 exp (=n; /(1448; P(F”))) 


j>r 


<2) exp (—n5 /(2885; P(F?))) (6.29) 


j>r 


by (6.26). There will be four conditions for r to be sufficiently large, expressed 
as r > Rj for positive integers R;. None of these conditions will depend on 
n. The first is that r > R, large enough so that the expression in (6.29) is 
< £, which exists by (6.27). The second is that r > R) > Rı where R, is 
large enough so that a nj < €, which exists by (6.25). The third is that 
r > R, > R, for R3 large enough so that £? > (256H, + 1)5? P(F?), which 
exists since H,62 — 0 as r— œ by (6.24). The fourth is that r > Ry > R3 
large enough so that 2 exp (—e?/(1285? P(F?))) < £. Now let for any r > R4, 
specifically r = Ra, 


ô := 6, P(F’)!" /2. (6.30) 
Then r > R3 implies that 6? = 5? P(F?)/4 < «7/4 and so 5 < £/2 as needed 
before (6.16). Take any n > max(n,(6, €), n2(e)) for nı(ô, €) as just before 
(6.19) and n2(e) as in (6.18). We will have the events B, and C, = C,(6, €) 


such that Pr(B,) > 1 — £ and Pr(C,,) > 1 — £. Thus defining X’, := B, NC, 
we have 


Pr(X;,) > 1 — 2e. (6.31) 


We obtain almost surely on X} that, by (6.28) and (6.29) and since 
r>R.> Ri, 


Pr, {sup per er = f» >e} <e. (6.32) 


Next, almost surely on X’, by (6.20) we have for all f,g € F with 
f — 8 € F8), If —gll3, < 26° = 6; P(F?)/2, so 


If — 8llon < 8r P(F?)/2 < 5, P(F?). 


It follows that on X’ 


n? 


using (6.23) 


Il fr ~ &rllon < Il fr = Filon F II f ig 8llon + lle agi Er ll2n 
< 8,P(F*)'? + 25,||Fllon 


IA 


46,(P(F?))!/”. (6.33) 
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/ 
n? 


Pr, {sup {VF - 8|: fg E F, f— g EFO} >e} 
(card F(r, x))” - 2exp (—£°/[648? P(F°)]) 


Then by the same Hoeffding inequality Proposition 1.12, on X 


IA 


IA 


2 exp (2H, — €°/(646; P(F°*))) (6.34) 
2exp (—e7/(12857 P(F’))) with &° > 256H,52 P(F*) 


IA 


A 


E 
since r > R4 > R3. We have 
WCF = DI WEE = fl + MEE — se) + WE = g). 
Combining (6.32) and (6.34) we get that for x € Xj, 
Pr,(sup{|v°(h)| : h e F(8)} > 3e) < 2e. 


So for x € X/, the integral in (6.22) is bounded above by 2e. Thus in (6.21), 
Pr(Ay.5,¢) < 4e, i.e. (6.17), and so (6.15) holds, proving Theorem 6.15. 


Corollary 6.19 (Pollard) Let (X, A, P) be a probability space, and let F be 
an image admissible Suslin Vapnik—Cervonenkis subgraph class of functions 
with envelope F € L>(X, A, P). Then F is a Donsker class for P. 


Proof. This follows from Theorem 6.15 and Theorem 4.52 for p = 2. 


Corollary 6.20 Let (X, A, P) be a probability space, F € L?(X, A, P), 
and F ={Flc: C €C} where C is an image admissible Suslin Vapnik— 
Cervonenkis class of sets. Then F is a Donsker class for P. 


Proof. Since F is measurable, the image admissible Suslin property of F 
follows from that of C. By Theorem 6.1 for p = 2, (6.12) holds and Theorem 
6.15 applies. 


For VC major and hull classes we have the following: 


Theorem 6.21 Let (X, A) be a measurable space and C C Aa VC class of 
sets. Assume C is image admissible Suslin. Let F be a VC hull class of functions 
for C, such as in particular a uniformly bounded VC major class for C. Then 
F is a Donsker class for every probability measure P on (X, A). 


Proof. A uniformly bounded VC major class is VC hull by Theorem 4.51. 
The class C, or in other words F = {14 : A € C}, is Donsker for every P by 
Corollary 6.20 with F = 1. Then each H,(F, M) is also Donsker for every P 
by Theorem 3.41, which gives the result. 
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One might ask whether in Corollary 6.19, “VC subgraph” can be replaced 
by “VC major” for any F € £L7(P), as Corollary 6.20 might suggest. It cannot. 
A somewhat stronger condition on the envelope F is needed, as the following 
will show. 


Proposition 6.22 If a probability space X has a decomposition into disjoint 
measurable sets A; such that 


YOX (Ag)? = +00, (6.35) 
k>1 
then there exists a VC major class F with envelope F := a 2* 14, which is 
not P-pregaussian, even though F may be in L*(P). 


Proof. If P(A) = C/(4*k*) where 1 < a < 2 and C = Cw is the suitable 
normalizing constant, then it is easy to check that F € L?(P) but (6.35) holds. 
Consider 


M 
F= Sl- =) 214, : o =Oorlfork>1, M= Ta 
3 
k=1 

Since 4/3 > 1, the sets {x : f(x) >c}, f€ F, c eR, are all of the form 
By u:= Chae Ax. Thus they form a VC class C with S(C) < 2, since if 
x € Åi, y € A; and z € Ag with i < j < k, then any set By y containing 
x and z must also contain y. So F is a VC major class. 


For any function f let f} := max(f,0), f- := — min( f, 0). Let 
N 
Ty := sup XOA = 0%/3)2*G p(Ar) 
oa=0,1 k=] 


and T := supy Ty. Then E|7\| < œœ (note also that ET; > 0) implies that ET 
is well-defined (possibly +00), and for each N, ETy < ET < E||Gp|lr. 
Now 
N Nv 
ETy = X 2*EGp(A;)4 — -X` 2* EG p(Ax)- 
N 2 P(Ag)+ 3 2 p(Ax) 


N N N 

1 1 1 

7 > X E|Ge(AÐI -32 2° E|Gp(A,)| = 2 2 E\G p(Ay)| 
k=1 k=1 k=1 


N 
= z XO A P(A, _ P(A P O/T) EE 


as N — o since the A, are disjoint, so 1 — P(A) > 1. So F is not P- 
pregaussian, by Theorem 2.47(a) or by Lemma 2.10, since the set of all real 
sequences {x Da with the smallest o -algebra making each x; measurable is a 
measurable vector space. 
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Let F be a measurable, finite-valued function on (X, A) with F > 1. Denote 
by Mr the set of all functions of the form ¢ o F, where ¢ : R* —> R*+ is an 
arbitrary nondecreasing function such that ¢(u) < u for all u € Rt. Then MF 
is a VC major class for the class C of sets {x : F(x) > c} or {x : F(x) > c} for 
1 < c < œ, which are linearly ordered by inclusion, and so S(C) = 1. Clearly 
F has envelope F. 

The Lorentz space £ p,q is defined (e.g., Ledoux and Talagrand 1986) as the 
space of real random variables ņ such that SE Pr(|n| > t))@/?dt/t < oo. 
Thus £3, is the space of the 7 such that f° Pr(|n| > 1)'/2dt < oo. Dudley 
and Koltchinskii (1994) proved the following, stated here without proof. 


Proposition 6.23 For any image admissible Suslin VC major class F with 
envelope F, including Mr, F is a P-Donsker class if and only if F € £L2,;(P). 


In the proof it is shown that if F ¢ £2(P), then 
o.@) 
XO 2 a = +00 
k=1 


where A; := {2'"!< F < 2%}, k>1. 


6.4 Necessary Conditions for Limit Theorems 


Theorems 6.1 and 6.15 imply that every class C C A with S(C) < +00, and 
which is image admissible Suslin, is a Donsker class, for an arbitrary law P on 
A. In this section it will be shown that to obtain, for all P, such a central limit 
theorem (or even the pregaussian property), for a class C of sets, the condition 
S(C) < +00 is necessary. Then it will be noted that some measurability, beyond 
that of ||P, — P ||c, is needed to obtain even a law of large numbers for S(C) < 
+oo (Corollary 6.8). Lastly, it will be shown that S(C) < œœ is necessary so 
that ||P, — Plc —> 0 in outer probability as n —> oo, uniformly in P. 


Theorem 6.24 Let (X, A) be a measurable space and C C A. Then S(C) < 
+0 if and only if for all laws P on A, {14 : A € C} is a pregaussian class (as 
defined in Section 3.1). 


Proof. “Only if” was proved in Corollary 4.48. To prove “if;’ suppose S(C) = 
+oo. Then for each n > 1, C shatters some set F,, with card(F,) = 4”. Let 
Ga := Fy, \ Law Fj. Then the sets G, are disjoint, card(G,) > 2”, and C 
shatters Gn. Take E, C Gn with card(E,) = 2”. Then the E, are disjoint and 
shattered by C. Some countable subset D C C shatters every E,,. We have 

CO CO 

y 1 = y 1 1 af 

n(n + 1) sn n +1 


n=l 
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Let P be the law on (J, En with P ({x}) = 1/[2"n(n + 1))] for each x € En. 


n=1 


Given n, for the isonormal process Wp on L°(P), foreach C € D, Wp(C) = 
Wp(C N En) + Wpr(C \ En). For 0 < K < œ, define the events 


Ei $ 
Ez i= {|We(B)| > 2K for some B C E,, and for all such B 


{|Wp(B)| < 2K forall B C En}, 


and all C € D with CN E, = B, |Wp(C \ En)| > K}. 


Then {|Wp(C)| < K forall C € D} C E UE. Let S, := Daek |We({x})|. 
Then 


sup {|Wp(B)|: B C En} > S,/2, 


so €, C {S, < 4K}. For each x € E,,, Wp({x}) has a Gaussian law with 
mean 0 and variance 1/(n(n + 1)2”) =: of, so E|Wp({x})| = (2/m)! on and 
Var (|Wp({x})|) = o(1 — 2), Thus ES, = 2" (2/1)? oy. Since Wp has inde- 
pendent values on disjoint sets, Var(S,) = (1 — 2)/[n(n + 1)]. For n large, 
ES, => 4K, and then by Chebyshev’s inequality 


1 
n2(ES, — 4K} 


2n 1/2 2 
aif (a) -axa =: f(n, K) > Oasn > oo. 


Turning to £z, let t(n) := 2?". Let the subsets of E,, be Bi,..., Bin). Let 
Mo := Ø and recursively for j > 1, M; := {|We(B))| > 2K} \ Uozi<j Mi. 
Let D; := M; N Aj,p,x where A; p,x is the event that for all C € D such that 
CN E, = Bj, we have |Wp(C \ E,)| > K. 

For any set A, Wp(A) has a Gaussian distribution with mean 0 and variance 
P(A). Let ®(x) := (27)! (ae exp(—t?/2)dt. Then since Wp has indepen- 
dent values on disjoint sets, 


Pr{S, < 4K} <Pr{|S, — ES,| > ES, —4K} < 


Pr(Dj) = Pr(M; N Aj.p,x) 
< Pr(M;) -2®(—K). 


Now £E C Sherer Dj, so since the sets M; are disjoint, 


Pr(€x) < 5 Pr(M;)-2@(—K) < 20(—K). 


1<j<t(n) 


Hence 


Pr(|Wp(C)| < K forall C € D) < f(n, K)+28(-K). 
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If we let n > +00, then K — +00, we see that Wp is a.s. unbounded on 
D CC, and note that G p(-) can be written as Wp(-) — P(-)Wp(1), so Gp is a.s. 
unbounded on C. 


Remark. Now recall the example at the beginning of Chapter 5, where C is 
the collection of all countable initial segments of an uncountable well-ordered 
set (X, <) and P is a continuous law on some o-algebra A containing all 
countable subsets of X. Then S(C) = 1 but sup4ec |(Pn — P)(A)| = 1 for all 
n. Thus the latter random variable is measurable. For this class the weak law 
of large numbers, hence the strong law and central limit theorem, all fail as 
badly as possible. This shows that in Theorem 6.7 and Corollary 6.8, the “image 
admissible Suslin” condition cannot simply be removed, nor replaced by simple 
measurability of random quantities appearing in the statements of the results. 
Further, for all A € C, 14 = 0 a.s. P, so vanishing a.s. (P) even with S(C) = 1 
does not imply a law of large numbers. 


Remark 6.25 If X is a countably infinite set and A = 2%, then for an arbitrary 
law P on A, limp oo SUP 4c 4 |(Pn — P)(A)| = 0 a.s. (see Problem 8 at the end 
of this chapter). But S(A) = +00, so the hypothesis of Theorem 6.24 cannot 
be weakened to a law of large numbers for all P. 


Next it will be seen that the Vapnik—Cervonenkis property is also necessary 
for a law of large numbers to hold uniformly in P or that there exist an estimator 
of P based on X,..., X, (which might or might not equal P,,) converging to 
P uniformly over C and uniformly in P. Here are some definitions. 

Let (X, B) be a measurable space. Let its n-fold Cartesian product be 
(X", B"). Let P be the class of all probability measures on (X, B). Let C c B 
be any collection of measurable sets. A real-valued function T,, on X” x C will 
be called a C-estimator if it is a stochastic process indexed by C, in other words, 
for each A € C, x > T,,(x, A) is measurable on X”. A C-estimator T, will be 
called an estimator if for each x, there is a probability measure u on A which 
equals T,,(x, -) on C. 

For any probability measure P on (X, B) and product law P” on (X”, B"), 
for x = (X1,..., Xn) so that X; are i.i.d. (P), we would like T, to be a good 
approximation to P with probability —> 1 as n —> oo. The goodness of the 
approximation will be measured by the loss function L(T,, P) := ||T, — 
P ||c. From it we get the risk r(T,, P,C) := EpL(T,, P)* where Ep denotes 
expectation with respect to P”. For any class Q of laws, let r(T,,Q,C) := 
sup{r(T,, P,C): P € Q}, andletr,(Q, C) be the minimax risk, i.e. the infimum 
of r(T;,, Q, C) over all C-estimators T,,. 
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In finding minimax risks for C-estimators we can assume T, takes values 
in [0, 1] since max(0, min(T,, 1)) will clearly have risks no larger than those 
of T,. 

For any C and Q, clearly 0 < r,(Q,C) < 1 and r, is nonincreasing in n. 
The following theorem holds for C-estimators and so a fortiori for estimators, 
whose values are probability measures. The following fact is mainly due to P. 
Assouad (see the Notes to this section). 


Theorem 6.26 Let P be the class of all probability measures on a sample 
space (X, B) and C C B. If the minimax risk r (P, C) < 1/2 for some n, then 
C is a Vapnik-Červonenkis class. 


Proof. Suppose S(C) = +00. Given n, for any m > n let F be a set with 2m 
elements, shattered by C. Let D C C be a collection of 2” sets which shatters F. 
Then ||" > Ille = ll lp, and since D is finite, |T, — P||p will be measurable 
for any C-estimator T, and any P. Let T, be a C-estimator defined on X”. Let x 
be a (“prior”) probability distribution on the set (finite-dimensional simplex) of 
all probability laws on F, with mass 1/ (= 


over an m-element subset of F. Then the maximum risk is bounded below by 
the risk for x and D, 


) at each law uniformly distributed 


sup Ep||Tn — Pile = ExEp|lTn — Pllo. 
P 


For each n-tuple x = (x1, ..., Xn) E€ F”, let ran x denote the range of x, ran 
x = {x,...,X,}, and let k = k(x) be the number of distinct x;, which is the 
cardinality of ran x. 

Let u be the distribution of x in F” averaged with respect to m, u := 
J P"dx(P). Then foreach x € F”, let (x) := 7x be the posterior distribution 
defined by x given x, namely, the law z, giving mass 1/ (2) to each law 
uniform on a set of m elements of F including ran x. Then for any C-estimator 

Ths 


E,Ep\||T, — P\lp = EyExcoo|lTh — PID. 


Let A := 2”.Foreach A € A, there is aunique D4 € D with DA A F =A. 
For any P in the support of 2, P(Da) = P(A). Let V,(x, A) := Ta(x, Da), 
an A-estimator. Then ||, — P||p > || Vn — P || 4, and we want to find a lower 
bound for Ezœ l| V — P || Į for each P and x fixed where V(A) := V,(x, A). 
Let y(k) := ay. If Po is any fixed law in the support of y(k), and if t is 
uniformly distributed over the group G of all (2m — k)! permutations of F 
which equal the identity on ran x (and thus permute the other 2m — k elements 
of F), then Py o t~! will have distribution y (k). Let t[A] := {t(y): y € A} 
for any set A. Each permutation t defines a 1—1 transformation A > t[A] of 
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A onto itself. Let (V o t)(A) := V(7[A]). Then we have 
1 


Eyl = Pla = ga pi LW T Poor lia 
` tEeG 
: Xiv Poll 
— H T= ; 
(Qm — kyl & i ae 


which by the triangle inequality is bounded below by 
Im- KY YTV otl- Pola =: IV — Polla. 
TEG 
Now let A(m) := {A € A: ranx C A and |A| = m}. For any A, B € A(m) 
there is a t € G with t[A] = B. It follows that V, restricted to A(m), is a 
constant. The support of Po is in A(m). For another set C € A(m), Po(C) = 


k/m. Thus 
k = 1 k 1 n 
—-Vi)> > . 
m ~ 2 m~ 2 2m 


’ 


IV — Poll Aon) = max (i -V 


This last bound, uniform in k, holds after integration with respect to m. 
For a given n, letting m — ov, we see that the minimax risk for ||: ||e is at 
least 1/2. 


A corollary of Theorem 6.26, taking T, as the empirical measure P,, is: 


Theorem 6.27 (P. Assouad) Jf (X, B) is a measurable space and C C B is a 
uniform Glivenko—Cantelli class of sets, that is, supp Ep||P, — P |g > 0 as 
n — œ, where the supremum is over all probability laws on (X, B), then C is 
a Vapnik—Cervonenkis class. 


Now it will be shown that the constant 1/2 in Theorem 6.26 is sharp: 


Proposition 6.28 For the class P of all probability measures on a sample 
space (X, B) there is always a B-estimator T, not depending on n or x, with 
r(T, P, B) < 1/2, so that for any Q C P, C C B, and n we have r,(Q, C) < 
1/2. Moreover when (X, B) is the unit interval [0, 1] with the Borel o-algebra, 
there is a class C which is not a Vapnik—Cervonenkis class, and for which T on 
C is given by a probability measure, so T is an estimator, not only a C-estimator. 


Proof. A C-estimator T (not depending on n or x) is defined by T(x, A) = 1/2 
for all A € B. Then r(T, Q, C) < 1/2 for any Q and C, so r,(Q,C) < 1/2 for 
all n. 

For Lebesgue measure A on [0, 1] letC := {Cy}x>1 be an infinite sequence 
of sets independent with probability 1/2; e.g., Cx is the set where the kth binary 
digit is 1. Recall that sets C;, i = 1,...,k, are called “independent in X” if 


14:59 


P1: KpB 


CUUS2019-06 CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


Problems 265 


for any set J C {1,..., k}, the intersection of all the C; for i € J and of the 
complements of the C; for i ¢ J is nonempty. In other words, the (Boolean) 
algebra generated by the C; has the maximum number, 2", of possible atoms. 
Clearly C; are independent in [0, 1]. For any m = 1,2,..., a collection of 2” 
sets independent in a set X shatters some subset with m elements by Theorem 
4.49. So C is not a Vapnik—Cervonenkis class. 


Next, it will be shown that one of the hypotheses of Theorem 6.7, existence 
of an integrable envelope function, is essentially necessary for a law of large 
numbers (Glivenko—Cantelli property). This is related to the fact that that for 
i.i.d. real random variables Y;,..., Yn,..., (Y1 +--+ Yn)/n converges a.s. toa 
finite limit if and only if E|Y;| < oo (RAP, Theorem 8.3.5). 


Theorem 6.29 If (S, B, P) is a probability space, F C £L'(S, B, P), \|P\lz < 
oo, and if F is a strong Glivenko—Cantelli class for P, i.e., ||P, — P\| —> 0 
a.s., then F has an integrable envelope function: E|| P\|\%- < œ. 


Proof. We have as n > 0 || P,-1 — P| > 0 a.s., || =*(Pr_1 — PII > 0 
a.s. and ||P/n||- — 0, so |[5x,/n| > 0 as. Let F(x) := |l F c= 
sup rex |f (x)| for x € S. Let Sj be copies of S whose Cartesian product $° 
is taken in the standard model. For each n write S% = S, x Ij4n5;. Then 
Lemma 3.6 implies that ||5x, /n||3-, where || ||* is with respect to P% on S™, 
equals F*(X,,)/n. The random variables F*(X,,) are i.i.d. Thus by the Borel- 
Cantelli Lemma °°, P°(F*(X,) > n) < 00, so °°, P(F* >n) < œ, 
which implies E F*(X 1) < co (RAP, Lemma 8.3.6). 


Problems 


1. On [0, 1] with Lebesgue (uniform) law P = U[0, 1], for any a > O let 
Fo) = {k* lio 1hr: 


(a) Show that fora < 1, ||P, — P || Fœ) is measurable and — 0 a.s. as n — oo. 
Hint: This follows from the Chebyshev inequality only for some values of œ. 
Instead, find the envelope function F of F(a) and show that it is integrable 
for P. Use the VC subgraph property. All subgraphs, by definition, include 
the set X x {0}, in this case the x axis. Except for that, the subgraphs are 
rectangles with sides parallel to the axes and lower left vertex at the origin. 
Show that the subgraphs form a VC class of sets. (There is a relation between 
width and height of the rectangles, but that need not be used in showing the VC 
property.) 

(b) Show that for œ > 1, ||P, — P || Fœ) does not approach 0 in probability. 
(Prove this directly, without using facts from Chapter 6. Let P, be based on 
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X1,..., X, iid. U[O, 1] of which the smallest is Xa) > 0. What happens when 
k > 1/Xq?) 


2. If C is the collection of all intersections of four closed half-spaces in R? 
(which includes all tetrahedra), and P is any law on the Borel sets of R3, show 
that ||P, — Pllc is measurable and goes to 0 a.s. as n —> oo. Hint: You can 
assume the result of Chapter 5, Problem 8. (Any polyhedron with 4 faces in R? 
is convex and is an intersection of four half-spaces.) 


Let Co(R*) be the class of all closed convex sets in R4. 


3. LetC := Co(R?) and F := {(0,0), (1, 1), (2, 4), (3, 1), (4, 0)}. Evaluate 
A°(F) and k°(F). Hint: The subsets G C F not in Cn F are those whose 
convex hull contains a point of F not in G. What are these sets? 


4. For C = Co(R*) show that for any law P on R* (on the Borel o-algebra, as 
usual) sup 4ec(Pa — P)(A) is measurable. 


5. Let C be the collection of all unions of three intervals in R. Show that A°(F) 
and k°(F) depend only on the cardinality m = |F| and evaluate each of them 
as a function of m = 1,2,.... 


6. Let F := {tlo : 0 <t < 1}. Show that F is a Donsker class for 
P = U(0, 1]. Hint: What is the envelope function F of F? As in Problem 1, 
use the VC subgraph property. Apply Corollary 6.19. 


7. For a law P on R suppose F € £°(R, P). Let G be the class of functions 
g on R with total variation < 1 and ||g|lsup < 1. Show that {Fg : g €G}isa 
Donsker class for P. Hints: Recall that a function of bounded variation is a 
difference of two bounded nondecreasing functions. You can assume the results 
of Problem 6 of Chapter 3. Look also at Theorems 3.36 and 4.51(b). The class 
of nondecreasing functions with a given bound (such as 1 or 2) is not a VC 
subgraph class (problem 4.10), but it is a VC major class. Apply Theorem 6.21. 


8. Show that if X is countable and A = 2* then A is a universal Glivenko— 
Cantelli class (i.e., prove the first statement in Remark 6.25). Hint: Given any 
P on X and e > 0, show there is a finite set F with P(F) > 1—e/k fora 
given k (such as 2 or 4). For each x € F and n large enough, P,,({x}) is as close 
as we want to P({x}). Use this to show that P,(X \ F) becomes less than some 
multiple of e. 


9. Let P({k}) := pe := ck-*/4 fork = 1,2,..., with c such that 7°72, pe = 
1. Let N* be the set of all positive integers and C the class of all subsets of Nt. 
Show that for some a > 0, ||P, — P||c > an~'/* for all possible values of P,, 
and so, C is not Donsker for P. Hint: For each n and possible value of P, there 
is a set A, C {1,2,...,2n} with at least n members and P,,(A,) = 0. Find a 
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lower bound on the possible values of P(A,,). A sum can be bounded below by 
an integral. 


10. In Problem 9, if px are replaced by c/k* for the appropriate constant c, C 
is a Donsker class for P, as will follow easily from Theorem 7.9, but does this 
follow from Theorem 6.15? Hints: What is the envelope function F? Consider 
Proposition 6.2. Does it help to consider larger envelope functions G > F? 


11. Let X = H bea separable Hilbert space with an orthonormal basis {e iat 
Let k; := 1/(1 + log J) and P(kje;) := P(—kje;) := 3/2 j» for j = 
1,2,.... Then P is a law on H (Chapter 2, Problem 19). Let F := {yE H: 
lly|l < 1} with y(x) := (x, y). Show that F is a weak Glivenko—Cantelli class, 
but not a Glivenko—Cantelli class. Hint: F is not order-bounded. 


12. Let F be a universal Glivenko—Cantelli class. 


(a) If f € F, show that f is bounded. Hint: Otherwise, it is not in L'(P) for 
some law P. 


(b) Show that Fo := {f — inf f : f © F} is uniformly bounded. 


13. Let (X, A) be a measurable space and X = US Aj for some disjoint 
sets Aj E€ A. Let C C A be image-admissible Suslin and such that for each 
j, S(C N Aj) < œ (but may grow arbitrarily fast with j). Show that C is a 
universal Glivenko—Cantelli class. 


Notes 


Notes to Section 6.1. Koltchinskii (1981) defined D6, F,y) when F = 1 
and y = P,. Pollard (1982) independently defined it for general F, for p = 2, 
and for distinct x( j). Both, in fact, used the minimal cardinality of an e-net (see 
Section 1.2 above). Theorem 6.1 is due to Pollard (1982, proof of Theorem 
9). I do not know references for Propositions 6.2 or 6.3. The symmetrization 
method is also due independently to Koltchinskii (1981) and Pollard (1982). 
Lemma 6.5 is adapted from Pollard (1982, Lemma 11), who proves it for a 
countable class F, thus avoiding measurability difficulties. From the countable 
case one can infer the result if there is a countable subset H C F such that 
Yn lleg = llYall¢ almost surely. Thus it suffices for v, on F to be separable in 
the sense of Doob. For most classes F arising in applications the empirical 
process is naturally separable. But the collection C of all singletons in [0, 1], 
for P = Lebesgue measure, appears as a rather regular collection for which 
the empirical process is not separable. In such a case it may appear unnatural 
to use a separable modification of the process, so that, e.g., P;({x}) = 0 for all 
x, even x = x1! Strobl (1995) proved Theorem 6.6(b) without supplementary 
measurability assumptions. 
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Theorem 6.7 extends Pollard (1982, Theorem 12), and Wolfowitz (1954) 
for half-spaces in R°. 

Vapnik and Cervonenkis, with their different measurability assumption, 
announced Corollary 6.8 in 1968, in the first publication to define the classes 
now called VC classes. This was the main theorem they announced. In 1971 
they published a longer paper giving proofs. 


Notes to Section 6.2. In Theorem 6.10, Vapnik and Cervonenkis (1971, The- 
orem 4) proved equivalence of (b) and (c). The proof here was first given in 
Dudley (1984). The Koltchinskii—Pollard techniques allowed some simplifica- 
tion of the proof of Vapnik and Cervonenkis. Steele (1978b) proved equivalence 
of (a) and (b) using Kingman’s theorem and proved Theorem 6.13 and (6.9). 
However, Steele’s assumption that || P, — P||¢ is measurable has been strength- 
ened to “image admissible Suslin.” Some such strengthening is needed in the 
proof. 


Notes to Section 6.3. Pollard (1982) proved Theorem 6.15 when the empirical 
processesv, are stochastically separable in the sense of Doob, as is usually 
true ab initio in cases of interest and can always be obtained by modifications 
of the process, which may, occasionally, appear unnatural (see the Notes to 
Section 6.1). The “image admissible Suslin” formulation was given in Dudley 
(1984). The proof, based on Pollard’s, is somewhat different, not only as regards 
measurability, but notably in that Proposition 6.18 extends a more specific result 
of Pollard. 

The implication from Pollard’s theorem 6.15 to Theorem 6.16, of Jain and 
Marcus (1975), was first given in Dudley (1984). 


Notes to Section 6.4. Theorem 6.24 and the example before Example 6.25 are 
from Durst and Dudley (1981). Assouad (1982, Proposition C) proved a form 
of Theorem 6.26 with 1/(8e) in place of 1/2, and so proved Theorem 6.27. 
I thank David Pollard for a remark leading to Proposition 6.28. Theorem 6.26 
and Proposition 6.28 were given in Assouad and Dudley (1990), which, like 
Assouad (1982), was not published. Assouad (1985) is a related work. 


14:59 


P1: KpB 


CUUS2019-07 CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


7 


Metric Entropy, with Inclusion and Bracketing 


7.1 Definitions and the Blum-DeHardt Law of Large Numbers 


Definitions. Given a measurable space (A, A), recall that £°(A, A) is the set of 
all real-valued A-measurable functions on A. Given f, g € L° (A, A) with f < 
g, ie. f(x) < g(x) forall x € A, let[f,g]:= {h € LÌ? (A, A): f <h<z}. 
A set [f, g] will be called a bracket. Given a probability space (A, A, P), 
1 <q<œ, FCCLA, A, P) with usual seminorm ||-||,, and £ > 0, let 
NO (e, F, P) denote the smallest m such that for some fi,..., fm and 
8i- -+> 8m in LIÇA, A, P), with |g; — fill, < £ fori =1,...,m, 


Fe| Jif gi. (1.1) 


iel 


Here log NO (e, F, P) will be called a metric entropy with bracketing. 


Note that the f; and g; are not required to be in F. For example, if F is the 
set of indicators of half-planes in R?, then f < h < g for f, g, h in F would 
require the boundary lines of all three half-planes to be parallel. If instead we 
let f be the indicator of an intersection of two half-planes and g that of a union, 
then there can be a nondegenerate set of h € F with f < h < g. 

Also note that an individual bracket [ f, g] has max(— f, g) = max(| f|, |g|) 
as envelope function, and so if (7.1) holds, for some €, then F has an envelope 
function maXxı<j<m max(— fj, gj) € LI(A, A, P). So, in this chapter, unlike 
the last, separate assumptions about envelope functions are not needed. 

If F C L’, then for q <r < œ, F C L1 and 


NOE, F,P)< NEE, F, P) forall e>0. (7.2) 


Recall that for a bounded real-valued function f, || f|lsup := sup, | f(x)|. Let 
dsup(f, 8) := || f — gllsup- It is easily seen using brackets [fi —é, fit e] that 


269 
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for any law P, 
NG (28, F, P) < D (e, F, dsup) - (1.3) 


Thus, for example, a set F of continuous functions, totally bounded for dsup with 
given bounds D (e, F; dsup), will have the same bounds on all NË (2e, F, P), 
l<q <œ. 

If F consists of indicator functions of measurable sets, then in finding 
brackets [f;, g;] to cover F, it is no loss to assume 0 < f; < g; < 1 for all 
i. Next, if C(i) := {x : fi(x) > 0}, D(i) := {x : g;(x) = 1}, and f; < lc < gi 
then 


fi < lea) < le < lbw < gi- 


So, Lfi, gi] can be replaced by [1ca), lpi]. If C is a collection of measurable 
sets and € > 0, let 


Ni(e,C, P) := inf{m: forsome C),...,C, and D},..., Dm in A, 
forall C € C there is an i with C; CC C D; and P (D; \ C;) < €}. 
Here the J in Nz indicates “inclusion.” Then it follows that 
Ni (e,C, P) = NỌ (E, F, P) where F={lc: CEC}. (74) 
We have the following law of large numbers: 


Theorem 7.1 (Blum-DeHardt) Suppose F C L! (A, A, P) and forall e > 0, 
NÝ E, F, P) < œ. Then F is a strong Glivenko—Cantelli class, that is, 


im ||P, — P|} = 0 a.s. 


Proof. Given ¢ > 0 take fi, ..., fm and g1,..., 8m in L!, m < œ, to satisfy 
(7.1) for q = 1. Let fm+; := g; for j = 1,...,m. By the ordinary strong law 
of large numbers there is an N such that 


Pr [sup max (P, — P) (F) > e] <E: 
n>N j<2m 


For each f e F let fi < f < gi with P (gi — fi) < £. Then if 
max (Pa — P)(F;)| < 8, 


we have 


A 


(Pa — P(A < (Pa — P)CAD| + Pra — PCF - ADI 
e+(P, + P) (gi — fi) 
3e + (Pa =P ile; = fi) < 5e. 


IA 


IA 
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Thus 


Pr* {sup lP, — Pl > se) < E. 


n>N 


One can then apply Lemma 3.8, saying that for any real-valued function X 
and real t, Pr*(X > t) = Pr(X* > t), and the easily checked fact that for any 
sequence y, of real-valued functions, (sup, w,)* = sup, Wy < +00 almost 
surely. 


Remark. No measurability assumption such as image admissible Suslin was 
needed in Theorem 7.1, as it was not needed in the proof. We used the law 
of large numbers over a finite set of f;, g;, then we used bracketing to control 
(Pa — P)(f) for fi < f < gi. We did have a star in the statement, which is 
not needed if F is image-admissible Suslin. For similar reasons, measurability 
assumptions are not needed in the statement or proof of bracketing central limit 
theorems such as Theorem 7.6. 


The sufficient condition in Theorem 7.1 is not necessary, as the next Propo- 
sition will show. For a class C of sets, we saw in Remark 6.25 that the Glivenko— 
Cantelli property holds for all P for the family C = 2 of all subsets of N, which 
is not a VC class. Conditions on a class of sets equivalent to the Glivenko— 
Cantelli property for a given P were given in Section 6.2. 


Proposition 7.2 There is a probability space (A, A, P) anda strong Glivenko-— 
Cantelli class F := {1c : C CC} for P, where C C A is such that for all 
e < 1/2, we have N) (e, F, P) = +00. 


Proof. Let A = [0, 1] with P = U[0, 1] the uniform (Lebesgue) law. Let 
Cm := C(m) be independent sets with P (Cm) = 1/m. Then to show 


lim sup |(P, — P)(Cm| =0 a.s., (7.5) 
n= m 
let0 < ¢ < 1. For m < 3/e we have by Bernstein’s inequality (Theorem 1.11) 
2 2 2e 
Pr (|(Pn = P)(Cn)| > £) = 2 exp =ne Se ee 
m 3 
< 2exp (—mne?/4) ; 
For m > 3/e, inequality (1.7) gives 


Pr (|P, — PX Cm) >€) < E(ne,n,1/m) < (e/(me))"". 
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We have for r = 1 or 2 


DG 3G) ee) 
= (re) O] 


Thus 0 n>2/e 2om>1 Pr (Pn — P)(Cm)| > £) < 00, and by the Borel—Cantelli 
Lemma, we get (7.5). 


Now, given functions fi, ..., f,and g1,..., g, in L! with P(g; — fi) < 1/2 
for each i suppose 


k 
(os Pom JES gi. 
i=l 
We may assume 0 < f; < g; <1 for all i. For each i with P(g; — fi) 
< 1/2, we have >> [P (Cm): fi < lcan) < gi} < +00 since if the series 
diverges, then for a subsequence Cmo) we have >», P (Grey) = +00 and for 
C= U, Cm), We have by the Borel—Cantelli lemma P(C) = 1, so fi < lc < 
gi implies P (g;) = 1 and P(f;) > 1 — P(g; — fi) > 1 — $ = 5 > 0, but then 
fi < Lcon) for only finitely many m, a contradiction. Thus 


a (£, {Cm}, P) = +00 for every e < 1/2, 


which finishes the proof. 


On the other hand, let C be the collection of all finite subsets of [0, 1] 
with Lebesgue law P. Then ||P, — Pllc = 1 Æ 0 although 14 = 0 as. for 
all A € C. This shows that in Theorem 7.1, N; < oo cannot be replaced by 
N (£, F, dp) = | for any £? distance d,. 

A Banach space (S, || ||) has a dual space (S’, |||) of continuous linear 
forms f: S> R with || fI := supi f|: x € S, |x|] < 1} < co (RAP, 
Section 6.1). One way to apply Theorem 7.1 is via the following: 


Proposition 7.3 Let (S, ||-||) be a separable Banach space and P a law on the 
Borel sets of S such that f \|x|| dP(x) < œ. Let F be the unit ball of the dual 
space S', F := {f € S': IfI <1}. Then for every e > 0, NẸ (e, F, P) 
< ©. 


Proof. By Ulam’s theorem (RAP, Theorem 7.1.4) take a compact K C S 
such that Sak llxl| dP(x) < £/4. The elements of F, restricted to K, form a 
uniformly bounded, equicontinuous family, hence totally bounded for the sup 
norm ||-||g on K by the Arzela—Ascoli theorem. Take fi, ..., fn € F,m < 00, 
such that for all f € F, || f — fj lk < £/4 for some j. Let g; := fj — £€/4 on 
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K,gj(x) := -lixx ¢ K; h; := fj + e/4on K,hj(x) := |x|, x ¢ K, all for 


j =1,...,m. Then for any f € F, if || f — filg < ¢/4, then gj < f <h; 
and P (hj — gj) < £, so NF (e, F, P)<m<oo. 


Corollary 7.4 (Mourier) Let (S, ||-||) be a separable Banach space, P a law 
on § such that f |x|| dP(x) < œ, and X,, X2,... iid. P. Let S, := X, 
+--+ Xn. Then S,/n converges a.s. in (S, ||- ||) to some xo € S. 


Proof. By Theorem 7.1 and Proposition 7.3, S,/n is a Cauchy sequence a.s. 
for ||-||, hence converges a.s. to some random variable Y € S. For each f € F, 
fO) = P(f)a.s.,i.e., Y E€ f-'(P(f)) as. Let (fin}m>1 C F be a countable 
total set: if fm (x) = 0 for all m, then x = 0. Such fn exist by the Hahn—Banach 
theorem (RAP, Corollary 6.1.5). Let D := (m Fa APCE. Then Y € Da.s., 
so D is nonempty. But if y,z € D then ||y — z|| = sup,, | fn(y — z)| = 0, so 
D = {xo} for some xo. 


Direct proof. Given £ > 0, there is a Borel measurable function g from S 
into a finite subset of itself such that P (||x — g(x)||) < £. To show this, let 
{x;}72; be dense in S, with xo := 0. For k = 1,2, ..., let g(x) := x; for the 
smallest i such that ||x — x; || = min, <x ||x — x, ||. Then gg is Borel measurable, 
|x — gx(x)|| < ||x|| for all k and x, and ||x — g(x)|| | 0 as k > œœ for all x, 
so P(||x — g(x)||) > 0 by dominated or monotone convergence. So choose 
k = k(e) such that P(||x — g(x)||) < £ and let g := gz. 

The strong law of large numbers holds for the finite-dimensional variables 
g (X;), with some limit x,, and 


n n 
n`! XX; — 8 (X)) <n! >> |x; -e)l 
j=l j=l 


> E||X; — g(X))|| <£ asn> œ 
by the one-dimensional strong law. Thus 
Limp oo ||¥e — Sn/n|| < € a.s. 


Letting £ | 0 through some sequence, we get S,/n converging a.s. to some 


Xo. 


Corollary 7.5 If F C L!(A, A, P) and {ô8;, : x € A} is separable for ||-\|z, 
then ||P, — P\lz = ||P, — PI} > 0 a.s. 


Proof. This follows from Corollary 7.4, since finite linear combinations of 
ôx, x € A, with rational coefficients, are dense in their completion for || || 7, a 
Banach space. 
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The proof of Proposition 7.3 and Corollary 7.4 together from Theorem 7.1 is 
no shorter than the direct proof. On the other hand, if F = flio, :0<t< 1} 
and P is Lebesgue measure on [0, 1], then Theorem 7.1 applies but Corollary 
7.5 does not. 


7.2 Central Limit Theorems with Bracketing 


In this section the bracketing will be in L?. A bracket [ f, h] will be called a 5- 
bracket if ({(h — f) d P)! < 5. The following main theorem will be proved. 
Then, Corollary 7.8 gives a hypothesis on NG ) for uniformly bounded classes 
of functions. 


Theorem 7.6 (M. Ossiander) Let (X, A, P) be a probability space and let 
F C L(x, A, P) be such that 


1 (2) 1/2 
/ (log NOG, F, P)) dx < oœ. 
0 

Then F is a P-Donsker class. 
Proof (Arcones and Giné). Throughout the proof, P( f) or E(f) will be used 
interchangeably to mean f f dP if it is defined. 

First, a lemma will help. Recall that an envelope G for a class G of functions 
is a measurable function such that |g(x)| < G(x) for all g € G and all x. 
Lemma 7.7 Let (X, A, P) be a probability space, and G a set of real-valued 
measurable functions on X, with an envelope G. Let B € Aand ô > 0. Suppose 
P(G1g3) < 8/2, and that for each g € G, A(g) is a measurable set with A(g) C 
B. Then 

Pr“{|l(Pn — PX8 lace ilg > 26} < Pr{l(P, — P(G1z)| > ô}. 

Proof. |P(glacg))| < P(Glg) < 6/2 forall g € G, so 


Pr*{||(Pa — P)(glacgyIlg > 26} < Pr*{l] Pa(glaw)llg > 36/2} 


< Pr{Pi(Glp) > 36/2} < Pr{|(Pr — PXG18)| > ô}. 


Now to prove the theorem, let Ny := ne F,P), k=1,2,.... Let 
yk := (log(kN,...N,))'/?. Then yx is increasing in k. By the integral test, 
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Dees (log N)? /2*! < 00, so 
o0 k 


X02 | dogk)"? + X “dog Nj)! 


k=1 j=1 


oe) CO [0,6] 
X dogo "?/2 + $ dog Nj)" $2 < 00. 
k=1 j=l k=j 


IA 


[e6] 
Da 
k=l 


IA 


Let Bk := pD yj/2/. Then By > 0 as k > œ. Let Ski := [ fki, hei], i = 
1,..., Ng, be a set of 2~*-brackets covering F. Let Tki := Ski \ Uy; Sks, SO 
that the sets T; are disjoint and each is included in a 2-*-bracket. If s yE 
{1,..., Nj} for j =1,...,k, let Ak, sirasi = Ni- Tj,s;. For each nonempty 
set Ak s, S = (51,..., Sk), choose fks € Ak s. For each f € F let Alf) := 
Ax,s and m;, f := fk s for the unique s := s( f) := s( f, k) such that f € Aks. 


Let Ay f := Alf) := sup{|g — h|: g, h € Ak(f)}. Then for any f € F, 
Az( f) is nonincreasing ink and EA;(f) < 27%. (1.6) 


Note that A; f depends on f only through s( f). Fork > 1,n > land f € F 
let 


By := B(k) := Blk, f n) := {x € X: Akf >n'?/ (2 w). 7D 
For any fixed j and n and x € X let 
tf := Tjah f, x) := min{k > j: x € B(k, f,n)}, 


where min := +oo. Then {tf = j} = BU, fin), ttf >j} = {Aif < 
n! AIt yh and for k > j, 


{tf > k} C X\ Bea C {Acf < Araf <n (An) 0.8) 


by (7.6), and {tf = k} C B \ Bka- 

For any f, g € F let p,)(f, 8) := 1/2* for the largest K such that for some 
s, f and g are both in Ags, or p7)(f, g) = 0 if this holds for arbitrarily large 
K. Then F is totally bounded for pp]. So by Theorem 3.34 it will be enough to 
prove the asymptotic equicontinuity condition for pr}, in other words that for 
every a > 0, 


lim lim sup Pr*{n!/?||(P, — P)(f — x; file >a} = 0. (7.9) 


J>% n>% 
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We can assume 0 < œ < 1. Then for any positive integers j < r, f — 1; f will 
be decomposed as follows: 


r—1 


fam f =F rP +f — alef + YF mf) 


k=j+1 


+ >) af — mal lepees (7.10) 


k=j+l 


this is easily seen for r = j + 1 and then by induction on r. The decomposition 
(7.10) will give a bound for the outer probability in (7.9) by a sum of four terms 
to be labeled (I), (ID, (IID, and (IV) respectively below. 

Let £ := a/8. Fix j = j(e) large enough so that 


Bj <e/24 and `K’ <2e. (7.11) 
k>j 


Then, choose r > j large enough so that, since y, increases with r, 
n! < e/4 and 2-exp(—y,2’ 12) < e. (7.12) 
Lemma 7.7 will be applied to classes of functions 
G = Gk s)=Gesi={f—mf: f EF, mf = frs} 


with envelope < G := Gks := Ax fis. 
About (I): for any function y > Oandt > 0, lys; < w/t. So by (7.7), 


n'PE(p Aj f) <2! yi ECA SY) 
<2 yj41 S46; < €/4 
for all f € F. Then since {tf = j} = Bj, 
n"? |P (S = mj flepai)| < nP (1a, As) < 8/4. 

Apply Lemma 7.7 to Gj s for each s with B:= B; := B(j, fjs n), ô := 
e/n'/?, and A(f — z; f) = B; = {tf = j} in this case. Then 

Pr [ [nP — POs logy > 2e} 

< Pr {nP \(P, — P) (A; Fjs)1B,)| > €}. 

Then summing over s, 


D = Pr* {|n P, — PCE — ay Pir-i) y > 2e) 
exp ( y? ) max, Pr {n'/?|(P, — PXA; fjs1g | > €} 


IA 


IA 


exp Vj e~* max, Var(Aj fj,s12,)- 


15:16 
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As n —> œ, for fixed j and £, by (7.6) and (7.7), since for each j and s, 
P(B;) = P(B(j, fjs, n)) —> 0 we have (I) converging to 0, so (I) is less than 
£ for n large enough. 

About (II): we have by (7.6) and (7.12) 


n'P ECA, fla, fenny) < nP EA, fy)? 


7.13 
< nP < €/4, e 


and 


E(A-fY Ua, pent2/0ry) < e/V y) 
= €/(2"*y,). 


Now by (7.8) for k = r, and since |f — x, f| < A, f, we have by (7.13), 
nrse(f — Tr f)lıf>r)| < £/4 for all f € F. Apply Lemma 7.7 for each 
s with A(f — x, f) := {tf > r} and B := {A; frs < n'/*/(2’y,)}, noting that 
A, f < A,- f for all f. Thus by (7.8) again, 


(7.14) 


(ID) := Př {|n (P, — Pf — m, Alepe le > 2e} 
< Pr{n'/||(P, — PXA, f + La, feme jæy lle > E} 


Then by Bernstein’s inequality (Theorem 1.11) and (7.8), 


e2 


I) <2. 3 

(D < 2- exp{y; 2e /(2'+2y,) + 2e/(3 - 2 y,) 
= 2-exp(y? — Z y,£/(1/6)) 
< 2. exp(—y, 27'e) 


} 


since 2y, < r < Bj; < £/24 by definition of 6, and (7.11). Then (ID 
< € by (7.12). 
About (III): fork = j + 1,...,7 — 1, by (7.7) and (7.6), 


nP EUA: Plij) < FH EUAS) < 2ye. (7.15) 


Then 


r-l 


MD) := Pn IP, — PD) Cf — mflr > 28) 
k=j+l 
r-1 


Yo PPIP, — PF — wef epawlle > 2 eyb; 
k=j+l 


IA 


since pS one 2 ey%41/B; < 2e. To apply Lemma 7.7 for each k = j + 
1,...,7 — l and foreach s, let ô := 2 ley 41 /B;, and A(f):= B := {tf = 
k} for f — xk f € Gk s. The hypothesis of the Lemma holds since 6; < ¢/24 
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(7.11) implies pias) ae < 2 1€/B;, and by (7.8). So 


r—1 


D < $O PAn, — PAP elle > 27 eyb; 
k=j+1 


Then by Bernstein’s inequality again, (7.6) and (7.8), 


r-1 (x Sypy/Q" 63) 
k 


Il) < 2. ex 
i 2 P |E BIE Ihe /O- Fyb) 


rol 2,,2 
= Ð 2-exp (vè - "1 
kjai 4B + € + Yr+1/GYrb;)) 


Now since y is increasing with k, 


2.2 2 E  Vk+1 
-eva J (482+ 5 Fer) 
2 E 
=g? 462 | — ) 
FURE /( Pi Ee i ae 
s-er | (a824 35) 
J 


The latter expression, since B; < €/24 by (7.11) and so 2 < e/(128;), is 
bounded above by 


5 
672/43 (=) -5 = Seve (5B): 


It follows that 


r-l r-l 


D< $ 2-exp(-eyz/(2B;)) < J 2-exp(-12y/) 


k=j+1 k=j+1 
r-1 
=2- XO IRN, ++ Ny)? <2- Yok? < 4e 
k=j+1 k>j 
by (7.11). 

About (IV): if k= j+1,...,r and g € Ak(f) then m(g) = m(f), 
Tr-1(g) = mk-ı( f), and Asf = Asg for all s=1,...,k, so {tf > k} = 
{tg > k}. Thus the number of distinct functions (77; f — mx_-1 f)lzf>x 18 at most 
exp(y). Also, mf € Ax—1(f) and so by (7.6) E(am f — mr-1f)) LIR, 
Now 


r 


(IV) := P fn? WP, — P) | YO (ef malah >e 
k=j+1 F 
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iy Pr{n!? (Pa — P) (ref — ma fylepse) |p > 2*en/B;| 


k=j+l 
from the definition of 6;. Bernstein’s inequality and (7.8) give 


r 622 -2k 2 -2 
(IV) < >» 2- exp (x VB; 


= 2 .9—-k R-15— 
fos 23-2 + 2¢9-kp-19-k 


Now since $; < ¢/24,8+ Zep7' < €/B; and 


(IV) <2 $` expa- €/B;)) <2 D> exp(—23y7) < 4e 
k=j+1 k=j+1 


as in (II). Thus the expression in (7.9) is less than a. Letting œ | 0, j —> œœ, 
and n —> o, the proof of Theorem 7.6 is complete. 


Theorem 7.6 implies the following for L! entropy with bracketing: 


Corollary 7.8 Let(X, A, P) bea probability space and F a uniformly bounded 
set of measurable functions on X. Suppose that 


i D2 ue 
f (log Nj (x2, F, P)) dx < 00. 
0 
Then F is a Donsker class for P. 


Proof. Suppose | f(x)| < M < œ forall f € F and x € X. Since multiplica- 
tion by a constant preserves the Donsker property (by Theorem 3.34), we can 
assume M = 1/2. Then for any f, g € F and e > 0, |f — g| < 1 everywhere. 
So if f|f —gldP < 2, then (f|f — g?aP)? <e. So Ne, F, P) < 


NPE, F, P), and the result follows from Theorem 7.6. 


It will be seen in the next section that Corollary 7.8, and thus Theorem 7.6, 
are best possible (provide a characterization of the Donsker property) in some 


cases. 


7.3 The Power Set of a Countable Set: Borisov-Durst Theorem 


Let P be a law on the set N of nonnegative integers. The next theorem gives a 
criterion for the Donsker property of the collection 2 of all subsets of N, for 
P, in terms of the numbers pm := P({m}) for m > 0. We also find that the 
sufficient condition given in Corollary 7.8 is necessary for 2‘. Recall N; as 
defined above Theorem 7.1. 


Theorem 7.9 (Borisov—Durst) The following are equivalent: 


(a) 2 is a Donsker class for P; 
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(b) Ep pi? < 00; 
(0) ie (log Nr (a, 2, P)) dx < œ. 


Proof. We have (c) > (a) by Corollary 7.8. Next, to prove (a) = (b), suppose 
D pal? = oo. The random variables W(m) := Wp (lim) (for the isonormal 
Wp on L?(P) as defined in Section 2.7) are independent and Gaussian with 
mean 0 and variances pm. We can write G p(f) = Wp(f) — P(f)Wp(l) since 
the right side is Gaussian and has mean 0 and the covariances of G p. Then 


XO ElWedmp)| = S 22/0)!” pj? 


m m 


diverges, while X Var(|Wp({m})|) < ><, Pm < œ. Thus for any M < 00, 
by Chebyshev’s inequality, 


lim P We spl >My] = 1. 
m—> Co 3 J 
So È; |We({j})| = +00 almost surely. Now `, Pm |We(1n)| < 00 a.s., so 
> |Gp({m})| = +00 a.s. Hence supycy Gp (14) = +00 as. and 2 is not a 
pregaussian class, so a fortiori not a Donsker class. Thus (a) => (b). 
Next, to prove (b) = (c). Equivalently, let us prove 


[0,6] 
poe as (log N; (4, 2N, py)” < o. 
k= 


We can assume pm > p, > 0 for m <r. For j = 0,1,2,..., let r; be the 


number of values of m such that 4~/~! < p}? < 4-/ and let Cj :=r;/4. 
Then >> j Cj < œ. Fork > ko large enough there is a unique j(k) such that 


Vent 24" <E ea. (7.16) 


Let m(k) := my = EÉ} rj. Then 


2j —k 
Casa Pm S È jw rj/4 d = 4™. 


Let A; run over all subsets of {1,..., m(k)} where i = 1,...,2”. Let B; := 
A; U {m € N : m > m(k)}. Then for any C CN, CN {1,..., m(k)} = A; for 
some i. Then A; C C C B; and P (B; \ Ai) <4-*. So N; (4-*, 2, P) < 
2+! Thus it will be enough to prove 


yom? /2* < oœ, (7.17) 
k 
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with )°, restricted to k > ko. We have 


1/2 


i(k) 
Sim? 2k = > YC Ja 
k k j=0 
i(k) oo 
< DY acy = Ye}? D 
k j=0 j=0 k: j<j(k) 


To prove this converges, since )* j Cj < œ, itis enough by Cauchy’s inequality 
-A2 
to prove >; (ee aa) < oo. Let k(j) be the smallest k such that 
J(k) > j. Then 
y oa ea, 
k: j<j(k) 
To prove that X`; 4/0 < oo, setting j(ko — 1) := 0 we have 
yao < Aik < Ati, 
jz k j: jk-1)<jsi® k 
For each k, let x(k) be the smallest « such that j(«) = j(k). Then from (7.16) 


for x, letting K denote the range of «(-), 4-* < Ljw C;/4/, so 


sr < yee 5 C,/4/ 
k k 


JZI) 


=) CA > gile)+« 5y 47k 
j 


KEK, j(k) <j k: k(k)=k 


<2) c4 So pe. 
J 


KEK, (kK) Sj 


Since j(-) is one-to-one on K, the sum is at most 4 par Cj < œ. 


Recall that if C = 2% then supycc (Pa — P)(A)| > 0 a.s. as n > of for 
any law P on C (Remark 6.25). 


Problems 


1. Let (K, d) be a compact metric space. For 0 < a < 1 let Ly(K, d) be the 
set of all real-valued functions f on K such that | f(x) — f(y)| < d(x, y)” for 
all x, y E€ K. Show that for any law P on K, La(K, d) is a strong Glivenko— 
Cantelli class for P. 


2. Show that {f € C[0, 1]: || fllsup < 1} is not a weak Glivenko—Cantelli class 
for P = U(0, 1]. 


P1: KpB Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


CUUS2019-07 


CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


282 7 Metric Entropy with Bracketing 


3. Let (X, A, P) bea probability space, such that {x} € A for all x € X and P 
has no soft atoms, as defined in Problem 9 of Chapter 5. Let C C A be such 
that C is linearly ordered by inclusion and generates A. Give an upper bound 
on N7(e,C, P)for0 <e <1. 


4. If X=N, A=2 and py := P({n}) := 1/2"*! forn=0,1,2,..., 
evaluate N;(e, A, P) forO < £ < 1. 
5. If X = N, A =2N and for some a € R, pa := P({n}) := 


Ca/[(n + 1)°(log(n + 2))*] for all n € N, where cy is such that >, pn = 1, for 
what values of œ is A a Donsker class for P? 


6. Let (X, A, P) be a probability space. For i = 1,...,k let F; C £°(X, A) 
each satisfy the hypothesis of Theorem 7.6. Then show that the hypothesis also 
holds for: 


(P= U Fi 

(b) F := {£} fj: fi €F; foreach j}. 

Let C; be a collection of sets, F; := {14 : A € Ci}. Show that the hypothesis 
of Theorem 7.6 also holds for F := {14 : A € C} where 


(c)C := Ui Aj: Aj € Cj for each j}. 


7. Let C := {[0,a] x [0,b]: 0<a<1, 0<b< 1}. Let P be the uni- 
form distribution on the unit square [0, 1] x [0, 1]. Give an upper bound for 
Nı(e,C, P)fr0<e<1. 

8. In the unit cube X := [0,1]? in Rf, with uniform (Lebesgue) law P, for 
some K < œ and c > 0, let C := C(K, d) be the collection of all subsets A of 
X such that whenever X is decomposed as a union of n? cubes of side 1/n, 
each with vertices having coordinates i/n, i = 0, 1, ..., n, the boundary of A 
intersects at most Kn“~° of these cubes. Show that N;(e,C, P) < oo for all 
€ > 0. Hints: (a) Given € > 0, show that for n > no(e) large enough, any union 
U of at most Kn¢~ of the n? cubes at the nth stage has P(U) < e. 


(b) Let the sets C(Z) and D(i) in the definition of Nz for the given € each be 
unions of cubes in the nth decomposition of [0, 1]“. How many such unions are 
there? (The number may be large, but you need to show that it is finite. Give 
an explicit bound.) 


(c) Give an upper bound for the number of pairs of such unions. (You need 
not restrict to pairs such that P(D(i) \ C(i)) < £, since again, you only need a 
finite upper bound.) 


9. Show that the hypothesis of the previous problem holds if C is the collection 
of convex subsets of the unit square in R?. 


10. Prove the same for the unit cube in R¢ for any finite dimension d. 
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11. If F is a class of indicators of measurable sets and ¢ > 0, show that 


NV, F,P)= NPE, F, P) (compare Corollary 7.8 and its proof). 


12. Prove Corollary 7.8 directly by Bernstein’s inequality (Theorem 1.11). 


Notes 


Notes to Section 7.1. Theorem 7.1 is due to Blum (1955, Lemma 1) for families 
of (indicators of) sets and to DeHardt (1971, Lemma 1) for uniformly bounded 
families of functions. Mourier (1951, 1953 pp. 195—196) proved the law of 
large numbers in general separable Banach spaces, Corollary 7.4. I do not 
know a reference for Propositions 7.2 or 7.3. In the proof of 7.4 (Bochner or 
Pettis) integrals of Banach-valued functions (defined in Appendix E) were not 
assumed, so they had to be, in part, reconstructed. 


Notes to Section 7.2. Theorem 7.6 is due to M. Ossiander (1987). The shorter 
proof presented here is an expanded version of that of Arcones and Giné (1993, 
Theorem 4.10) and applies a technique from Andersen et al. (1988), who proved 
an extended bracketing central limit theorem. Corollary 7.8 was proved earlier, 
first for classes of sets in Dudley (1978), then for classes of functions in Dudley 
(1984). 


Notes to Section 7.3. In Theorem 7.9, it is not hard to prove that (b) = (a). 
Durst and Dudley (1981) proved the equivalence of (a) and (b). I. S. Borisov 
(1981) discovered and announced the more difficult implication (b) = (c). I 
have not seen his proof. 
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Approximation of Functions and Sets 


8.1 Introduction: The Hausdorff Metric 


In this chapter upper and lower bounds will be shown for the metric entropies 

or capacities of various concrete classes of functions on Euclidean spaces and 

sets in such spaces. Some metric entropies with bracketing are treated, and 

some without. Metrics for functions are in £P, 1 < p < oo. For sets we use dp 

metrics dp(B, C) := P(BAC) or the Hausdorff metric, defined as follows. 
For any metric space (S, d), x € S, anda nonempty B C S, let 


d(x, B) := inf{d(x, y): y € B}. 


Then d(x, B) = 0 if and only if x is in the closure of B. If A C B, then clearly 
d(x, B) < d(x, A). (In order to preserve this when (S, d) is unbounded and 
A = Ø we need to set d(x, Ø) := +00 and so we will do that in all cases.) For 
nonempty bounded sets B, C C S the Hausdorff pseudo-metric is defined by 


h(B, C) := max (sup, d(x, C), sup,ec d(y, B)). 


To check that this is a pseudo-metric, clearly A(B, C) = h(C, B) > 0. To 
show that the triangle inequality h(B, D) < h(B,C)+A(C, D) holds for 
any nonempty, bounded sets B, C, and D, first, for any 6 > 0 and x € B 
there is some y € C with d(x, y) < h(B, C)+ ô. Then there is some z € D 
with d(y, z) < A(C, D) + ô, so d(x,z) < h(B, C)+ h(C, D) + 26. Letting ô 
decrease to 0 we get d(x, D) < h(B, C)+ h(C, D). A symmetric argument 
starting with a point of D gives the triangle inequality. Thus h is a pseudomet- 
ric, as claimed. 

Then A is a metric, called the Hausdorff metric, on the collection of bounded, 
closed, nonempty subsets of S, since if B and C are two distinct such sets, 
clearly h(B, C) > 0. 


284 
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On R? we have the usual Euclidean metric d(x, y) := |x — y| where 
|u| := (u? +--+ wy" u € R41 Ifd > 2, for any set H C R¢—! and function 
f from H into [0, oo] let 


Ip F) {ee : 0 < xa < fm), xa EH} 


where xa) := (x1, . . -, Xa—1). Then J(f) is the subgraph of f. For any two 
bounded functions f > Oand g > 0on H, clearly A(Jf, Jg) < dsup( f, 8). Thus 
for any collection F of bounded real functions > 0 on H, and any € > 0, 


D (e, {Jp: f € F}, h) < D (e, F, dsup) - (8.1) 


From here on, assume that H is a bounded Borel set in R¢~! with nonempty 
interior, whose Lebesgue measure therefore satisfies 0 < AI (H) < +00. In 
one case to be considered, H will be the (d — 1)-dimensional unit cube 7%! 
where If := {x € R°: 0< x; <1, j=1,...,d}. 

For two nonnegative functions f and g on H, the symmetric difference of 
J, and J, is given by 


JpAJ, = {(u, v): ue H, f(u)< v < g(u) or g(u) < v < f(u)}. 


Ife > Oanddup(f, 8) < £, let F := max( f — £, 0). Then0 < F <g < f +e, 
so Jp C Jg C Jf+e. If f and g are also measurable functions it follows from 
this and the Tonelli—Fubini theorem that we have the following bounds for 
Lebesgue d-dimensional measures: 


MUA) < EATI(H), MU Rede) < 2A CR), 


Let F be a class of bounded nonnegative continuous functions on H, totally 
bounded for dup. Given £ > 0, let m := D(e, F, dsup) and let fi,..., fm in 
F be such that dup(fi, fj) > ¢ for 1 <i < j <m. Then for each g € F, 
dsup(g, fj) < € for some j = 1,...,m by definition of D(e,-,-). Thus for 
F; := max(0, f; — £) for j = 1,...,m we have F; < f; foreach j, fj +e — 
F; < 2g, and F C U; LF, fj + €]. Moreover, dsup( fj, g) < € implies that 
Jr, C Jg C J¢,+e. From the last two facts it follows that we have the bracketing 
covering number bounds, related to (7.3) in light of (7.2): 


(i) if P is any law on H, then for any r with 1 < r < on, 
Mi Qe, F,P) < Dhe, F, dsup); (8.2) 


(ii) if Q is any law on H x [0, co) having a density q with respect to Lebesgue 
measure on R? with q(x) < M < oo for all x, then by the Tonelli—Fubini 
theorem again 


N,(2Med"'(H), {Jp : f € F}, Q) < De, F, dsup). (8.3) 
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In the converse direction here is a lower bound for the Hausdorff distance 
of subgraphs of Lipschitz functions. Recall that 


fll. = sup | f(x) — FOI / lx — yl. 


Then we have: 


Lemma 8.1 Let H be a bounded nonempty subset of R}. Let 0 < K < oo 
and let f and g be functions from H into [0, +00) such that || f ||, < K and 
elle < K. Then h(Jp, Jg) = dsp f, g)/v 1 + K?. 


Proof. If t := dsup(f, g) = 0, f = g and there is no problem, so assume that 
t > 0. Take 6 such that O < ô < t and take u € H such that |(f — g)(u)| > 
t— 6. By symmetry we can assume that f(u) > g(u). To find a lower 
bound for d((u, f(u)), Je), let G(w) := g(u) + Kd(w,u) for all we H. 
Then since ||g||, < K we have g < G and J, C Jg. So d((u, f(u)), Je) = 
d((u, f(u)), Jg). Consider half-lines on which w = u + s(wo — u) for fixed 
wo # u and s > 0. Then the graph of points (w, G(w)) on such a half-line is 
itself a half-line L in R®. Since (u, f(u)) is not in Jg, a line segment joining 
(u, f(u)) to a point of Jg must pass through the boundary of Jg, which is 
the graph of G. The line extending L forms an angle 6 = tan™! K with the 
subspace {x € R?: Xa) = 0}. The closest point p to (u, f(u)) in L is such that 
(u, f(u)) — pis perpendicular to L, and so the three points (u, g(u)), (u, f (u)), 
and p form a triangle whose angle at (u, f(u)) is also 6. It follows that 


d((u, fw), Jc) = (f — g\(w)cos6 = (f — g)(u)/ V1 + K? 
> (dsl f, 8) — 8)/V1 + K2. 


Letting ô | 0, the conclusion follows. 


Recall that a bounded number of Boolean operations preserve the Vapnik— 
Cervonenkis property (Theorem 4.8). The same holds for classes of sets satis- 
fying bounds on metric entropy (with inclusion). For any families C; of subsets 
of a set X, extending the notation in Section 4.5, let 


kC; t= {MA : Aj €C; forall j}, 


k Cj 


j=1 


(UF Aj: Aj eC; forall j}. 


Theorem 8.2 Let (X, A, P) be a probability space and Cj C A for j= 
1,..., k. Let fij(e) := log Ni(e,Cj, P), foj(e) := log D(e, Cj, dp) for 
jJ=l,...,k. LetCo := ig. If fori = 1 or 2, there area y > 0 and con- 
stants M,,..., Mg such that fi;(€) < Mje~” for0 < € < land j =1,...,k, 
then the same holds for j =0. The statements also hold for i = 1 or 2 for 
Co := Li _,C;. 
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Proof. For N7, given O < € <1, for each j = 1,...,k, take mj brackets 
[A ;r, Bjr] covering C; with P(B;, \ Ajr) < €/k forr =1,...,mj and mj < 
exp(M j(k/e)”). If Aro C Ciro) C Bjirg) for j = 1,...,k and some r(j), 
then 


.— mk .— nk aa ak 
Aw) = Naa 4jr C Co = Ajr C Bo = Oj Biro) 


and P(B) \ Ac) < k(e/k) = e. The result for N; then follows with Mo := 
k”(Mı + M2 + --- + M). Without inclusions, and/or for U instead of n, the 
proof is similar. [It follows that if each C; for 1 < j < k satisfies an inclusion 


(bracketing) condition sufficient for the Glivenko—Cantelli or Donsker property 
respectively, then so does Co := mC; or Li _ Cj: 

Corollary 8.3 Under the assumptions of Theorem 8.2 for i = 1, 

(a) For any y withO < y < +00, Co is a Glivenko—Cantelli class for P. 

(b) If0 < y < 1, then Co is a Donsker class for P. 


Proof. This follows from Theorem 8.2 and for (a), from the Blum—DeHardt 
theorem 7.1. For (b) it follows from Corollary 7.8. 


8.2 Spaces of Differentiable Functions and Sets with 
Differentiable Boundaries 


For any a > 0, spaces of functions will be defined having “bounded deriva- 
tives through order a.” If $ is the largest integer < a, the functions will have 
partial derivatives through order 6 bounded, and the derivatives of order B 
will satisfy a uniform Holder condition of order æ — £. Still more specifically: 
for x := (x1,..., xa) € R? and p=(Pi,---, Pa) € N? (where N is the set of 
nonnegative integers) let [p] := pı +---+ pa and 


D? := glPl jax?! ig Of 
For a real-valued function f on an open set U C Rf having all partial derivatives 
D” f of orders [p] < £ defined everywhere on U, let 


jim jam P . 
I fll = fleu = max sup {|D fœ]: xeu} 


+ max sup {|D? f(x)— D’ f(y)|/|x — yP}. 
[PI=B xy, x,yeU 
Here D9 f := D%%--% f := f. Let I! denote the unit cube {x € R¢: 
O<x;<1,j=1,...,d}, and recall that xu) = (41,...,Xa-1), x € R?. Let 
F c R’ be a closed set which is the closure of its interior U. Let Fa,K(F) 
denote the set of all continuous f : F — R such that when f is restricted to 
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U,\lfllau < K. Fora = 1, Fi ,x(F) is the set of bounded Lipschitz functions 
f on F with || fllsup + Il fll, < K.A real-valued function g on any metric space 
(S, e), e.g., R7 with usual metric, is said to be Hölder of orderaif0<a<1 
and |g(x) — g(y)| < e(x, y)“ for all x, y € S. 

Recall the bounded Lipschitz norm || f|laz := I flle + Ilf llsup. It will be 
seen how the Lipschitz property is equivalent to having bounded derivatives 
of order 1 (partial derivatives, in dimension d > 2). In dimension d = 1, a 
Lipschitz function f is absolutely continuous, and so by a classical theorem 
of Lebesgue, it has a derivative f'(x) for almost all x. Clearly | f’(x)| < || fllz 
whenever f'(x) exists. On the other hand, if g is any bounded measurable 
function with ||gllo < M and f(x) := fý g(t)dt, then f is Lipschitz with 
Ifl < M and f'(x) = g(x) for Lebesgue almost all x (e.g., RAP, Theorem 
7.2.1). Here f’ need not be continuous, e.g. if f(x) = |x|, at 0. 

A real-valued function f on an open set U C R® is said to be Fréchet 
differentiable at x € U if there is a vector v := (Df)(x) such that 


f(y) = f@)+u-(y—x) + olly—x|) as y > x. 


Clearly, this implies that the partial derivatives 0f(x)/dx; for j =1,...,d 
exist and are the components of (Df)(x). From Lebesgue’s 1-dimensional 
theorem and the Tonelli—Fubini theorem, it is clear that if f is Lipschitz, then 
these partial derivatives exist for 4¢-almost all x € U. H. Rademacher proved 
that moreover, f is Fréchet differentiable at A4¢-almost all x (see the Notes). 

Next some related families of sets will be defined. Let Ge, x a := Fa,K (1 ay: 
For d > 2 let C(a, K, d) be the collection of all sets 


Jp = J(f) = [xe 1: 0< xa < fw}, fE Gakkai f 20. 


If g and h and two functions defined for (small enough) y > 0, recall that 
g x h (as y | 0) means that 


0< lim inf(8/4)) < limsup(g/h)(y) < +00. (8.4) 
7 y40 


Clearly, if f € Ga,x,a and [p] < a, then D? f € Ga—ip],K, a- Let Ba := {x € 
Rf: |x| < 1} and let By = {x € R? : |x| < 1} be the open and closed unit 
balls respectively in R. Here are some bounds on metric entropies, of which 
Kolmogorov proved the main, first conclusion (a). 


Theorem 8.4 (Kolmogorov) Let 0 < K < œ, 0 <a < coandd > 1. 
(a) Ase | 0 
log D(e, Gak dsup) x grala, 


(b) For some T := T(a, K, d), any law P on I}, 1 < r < œ and0 <e <1, 


log Nt} (E, Gaxa, P) < Te 4, 
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(c) For the Hausdorff metric we also have for sucha T 
log D(e, C(a, K, d + 1), h) < T672. 


(d) If Q is a law on I4*' having a density with respect to Lebesgue measure 
bounded by M, then for some Mı = M,(M,d, K, a), 


log N;(e, C(œ, K,d+ 1), Q) < Meie, O<e <1. 


(e) Parts (a) and (b) hold for Ba in place of If and so Fa (Ba) in place of 
Ga, Kd, with a possibly larger constant T. 


Corollary 8.5 Let 0 < K < oo. For any dimension d: 


(a) Ga,K a is a Glivenko—Cantelli class for any œ > 0 and law P on I’. 
(b) Gu.x.a is a Donsker class for any a > d/2 and law P on 1°. 


(c) If a >d, then C(a, K,d + 1) is a Donsker class for any law Q on 2+! 
with a bounded density. 


Proof. Parts (a) and (b) follow from Theorem 8.4(b) for r = 1 by the Blum- 
DeHardt Theorem 7.1, and forr = 2 and Ossiander’s Theorem 7.6, respectively. 
Part (c) follows from Theorem 8.4(d) and Corollary 7.8. 


Side Remark. The need for differentiability of order larger than d/2 for some 
conclusion, as for the Donsker property in Corollary 8.5(b), also occurs in the 
very important Sobolev embedding, relating to Schwartz (—Sobolev) distribu- 
tion theory and partial differential equations. 

For 1 < q < œ let LLRI ) be the space of measurable real-valued func- 
tions f such that Jo | f(x)|4 dx < co for every bounded open set U. If 
fe om (R2), then a Schwartz distribution ( generalized function) [ f ] is defined 
by [f](@) = fi f(x)b(x)dx for every ¢ € D, the space of C% functions 
with compact support. For a multi-index p = (p,..., Pa), pj € N, one writes 
D?[f] = [g] if also g € L}, and (—1)!”! f(D? $) = [g](@) for all @ € D. (If 
f is C®, then one can take g = D” f in the classical sense via integration 
by parts [p] times.) One form of Sobolev embedding (e.g., Hérmander, 1983, 


Theorem 4.5.13(ii)) states: 


Theorem. Let 1 < q < œo, let m > 0 be an integer, and let u € L} (R2). 


Suppose for each p with [p] =m, D?[u] = [up] for some up € Ee AR). 
If mq > d, then [u] = [v] (u = v almost everywhere for Lebesgue measure) 
where v is continuous and moreover Holder of order y if 0 < y < 1 and 
y <m-—(d/q), meaning that for any compact K C R4, SUP, yeK xy V(x) — 
u(y)|/|x — yl” < æ. 
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Often q = 2 is taken, as Hilbert space properties of £? are convenient. In 
that case the condition on m is m > d/2, as in Corollary 8.5(b). An example 
shows sharpness: let d = 2, m = 1, q = 2, and for r = yx? + y? on R?, let 
f(x, y) = logdog(1/r)) forr < 1/e, defined otherwise elsewhere to make it a 
C™ function except at (0, 0), approaching which it is (very slowly) unbounded 
and discontinuous. The first partial derivatives of f are in Li, for q = 2 but 


loc 
not for any g > 2 (which would contradict the embedding theorem). 


Note. For a > 1, in Theorem 8.4(c) about h, the order ¢~4/% is precise; see 
Corollary 8.10. 


Proof of Theorem 8.4. If part (a) holds, then part (b) follows using (7.2) and 
(7.3). Then parts (c) and (d) follow by (8.1) and (8.3). To begin the proof of 
part (a), for each f € Gu xa, X € I4 and x +h € I? write the Taylor series 
with remainder 


B 
fe +h) -Y Qx, h) = R&, h) (8.5) 


k=0 


where for each x, Q;(x, -) is a homogeneous polynomial of degree k in h and 
by the mean value theorem 


|R(x, h)| < Clh|* (8.6) 


for some constant C = C(d, K, a). Then C > 1 will be taken large enough so 
that whenever [p] < 8 we also have 


B-[p] 


|Rp(x, h)| := or +h) — 3 Ox, p(x, h) (8.7) 


IA 


Cih|7—} 


where for each k, p and x, Qk,p(x, -) is also a homogeneous polynomial of 
degree k. 

To prove one half of the first conclusion in Theorem 8.4 it needs to be shown 
that 


lim sup [log DE, Gu. K.d, dsup)] Ell” < o. (8.8) 
e€}0 


Given 0 < e < 1, let A:=(e/(4C))'/* < 1. Let xa), , Xg) be a A/2- 
net in 7%, i.e., sup {inf j<s |x — xol :xeE 1°} < A/2. Here we can take s < 
M28£7%/* for some constant Mə = M2(d, C); specifically, one can choose the 
xq) as centers of cubes of a decomposition of /“ into cubes of side 1/m where 
m is the least integer > d'/?/A. For each multi-index p with [p] = k < B 
let p! := []_, p;! and for h € R4, h? := [5 h’’. Then for f € Ga,x,a let 
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QP(x, h) := (DP f)(x)h?/p!. Thus in (8.5) 
Qx, h) = $ OG, h). (8.9) 


[p]=k 
Let ex := £/ GA*e"), k =0,1,..., 
Aip = Aip (f) = [D fa@)/e], i51, l=k<B 


where [x] denotes the largest integer < x, —00 < x < OW. 
Given some A := {A; p: i < s, [p] < £} let 


GaK a(A) = {f € Gaxa : Aip(f) = Aip forall i < s, [p] < B}. 


Lemma 8.6 If f, g € Go.x.a(A) for some A, then 
sup {IF — D|: x eI} < e. 
Proof. Let F := f — g. Whenever [p] < and i <s, we will have 
|(D? F) (x)| < £tpı- Also, by (8.7), 
IDP F(x +h) — D? F(x)| < 2C|h|*-* if [p] = £. 


For each y € J“ take an xg) with |y — xa | < A/2. Then from (8.5), (8.6), and 
(8.9) with h := y — X(i)> 


B 
IFO < 2C|hI* +Y e SS [h?|/p! 


k=0 [p]=k 


B 
<2CA* +) eA‘ X 1/p! 


k=0 [pl=k 


< e! max s; Ak +eé/2 < €, 
k<p 


proving the Lemma. 


Now continuing the proof of Theorem 8.4, it follows that 
De, Ga, Kd, dsup) < Na,K,d 


where Nog, x,a is the number of distinct nonempty sets Gy, x,¢(A). Let the xj) be 
ordered so that for 1 < j < s, |x) — xo] < A for some i < j. Such an order- 
ing clearly exists for d = 1. Then by induction on d, we enumerate subcubes 


beginning and ending with subcubes at vertices of /“, where we first enumerate 
cubes on one face /¢~', then on the adjoining level, etc. 

Now suppose f € Ga, K a (A) (so that this set is nonempty) and suppose given 
the values A; fori < j forsome j < s.Choosei < j such that E = xj) | < 
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A. Take the Taylor expansion as in (8.5) with x = xa), h = xg) — xa. By (8.7) 
and (8.9), 


B-[p] 
D? f (xj) — = > DPH f (xo) h1 /q! < Chje, 


k=0 [q]=k 


Now | DPT f(x@) = Ai p+q€k+IpI| < Elp+q] for [q] =k=0,1,...,B- [p]. 


Let 
b-Ip] 
DG, f, p) = | D? FAH) — > Ek+ (pI = Aip+gh"/q! | /€p1- 
k=0 lal=k 
Then by the latter two inequalities, 
B-Ip] 
IDF. js PY < CAT X ety A® XO 1/4! | /ein 
k=0 [q]=k 


< 2Ce4/4C + ef < 3e4/2. 


So for f € Ga,x,a and given the A;,,(f) fori < j there are at most 3e! 4+2 < 
4e! possible values of A j,p for a given p. The number of different p € N1 
with [p] < £ is bounded above by (6 + 1)¢. Thus the number of possible sets 
of values {Aiphines for the given j > 2 is at most exp((d + log4)(6 + 1)%). 


The number of possible values of the vectors {Ai, ee g is at most 


(202K + Det). 


Thus by Lemma 8.6 


(2e4(2K + 14ed Je) 
exp((B + 1)4 {(d + log 4)M267™* + log (ef (4K + 2)/e)}) 
exp(Je~4/*) 


Na, Kd 


IA IA IA 


for some J = J(d, a, K) not depending on £, proving (8.8). 
In the other direction we need to prove 


lim, inf log D(e, Ga, K.a, dsup)e™”” > 0. (8.10) 


Let f be a C% function on R4, 0 outside J“ and positive on its interior, such 
as 


d 
f(x):= Ife (xj), where 
Jal (8.11) 
exp(—1/r)exp(—1/(1—1)), O<t<1 
g(t) := 


0 elsewhere. 
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For m = 1,2, ..., decompose I? into m? subcubes Ami of side l/m, i= 


1,..., mf. Let xa) be the vertex of Ami closest to the origin, in other words, 
the vertex of Ami at which all coordinates are smallest. Given œ > 0 set 
fix) =m“ f (m (x = xi) . Note that x œ> m(x — xq) is a 1-1 affine map 
of R? onto itself, taking the interior of Am; onto that of J d. Here an affine 
transformation is a function A from R¢ onto itself of the form A(x) = Bx + v 
where B is a linear transformation, defined by ad x d matrix which in this case 
will need to be nonsingular, and v is any fixed element of Rt. Thus f;(x) > 0 
if and only if x is in the interior of A,,;. Let s := sup, f(x) (= e~™ for our f). 
For any S C {1,...,m4} let fs = Dies fi. Then I fsla < Il fll, =: B while 
for S A T, sup, |(fs — fr) (x)| = m°s. Thus for any £m := Km~°s/(3B) we 
have D (€m, Ga,K,d» dsup) = 2”" > exp (Cen) for some C = C(K, d, œ, s) 
not depending on m. Since £€m+1/Em —> 1 as m —> œ this is enough to prove 
(8.10), and so finish the proof of Theorem 8.4(a). 

Now it will be shown how to adapt the proof to part (e) for the ball By in 
place of If. In S! := {x € Rf: |x| = 1}, for0 < & < 1, take D(e, SL, e) 
points at distances > £ apart where e is the Euclidean metric on R¢. The balls 
of radius ¢/2 with centers at these points are disjoint. Thus by volumes, then 
the mean value theorem, 


neste (Sy = (145)'- (4) 


d-1 gya 
< de(1+5) < de (3) , 


and so D(e, S™!, e) < 2d(3/e)*—!. Take a maximal set S in S¢~! of points 
at distances > ¢/2 apart, with |S] < 2d(6/e)‘—!. Let W consist of 0 and all 
points tx for x € S and t = je/2, j =1,2,..., |2/e]. Then W is -dense in 
B4 and 


|W] < 1+2d(2/e)(6/e)""! < d(6/e)*. 


Then, starting at 0 and moving outward along each segment tx, x € S, 0 < 
t < 1, through points in W, one can do the same proof as for (8.8) in 7%, 
except for larger constants J, M1. For a lower bound of the form (8.10), note 
that B4 includes a cube of side d~'/? centered at 0. This finishes the proof of 
Theorem 8.4. 


Next, some lower bounds for covering or packing numbers will be given. 
For a collection F C L'(A,.A, P), we have the £! distance di p(f, g) := 
P(lf — 8l). 


Theorem 8.7 Let P be a law on I“ having a density with respect to Lebesgue 
measure bounded below by y > 0. Then for some C = C(y,a, K, d) > 0, for 
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€ > 0 small enough, and 1 < r < œ, 


NV Ce, Gu,K,d, P) > Nye, Ga,K,d>P) = D(e, Ga,k,a, di,p) 
exp (Ce). 


Ifd > 2, for small enough £ > 0, and M := C(y,a,K,d-—1), 


V 


Ni(e,C(a, K, d), P) > D(e,C(a, K, d), dp) > exp(Me“@/*) , 
Proof. The following combinatorial fact will be used: 


Lemma 8.8 Let B bea set with n elements, n = 0, 1, . . .. Then there exist sub- 
sets E; C B,i=1,...,k, where k > e"/®, such that for i + j, the symmetric 
difference E; AE; has at least n/5 elements. 


Proof. For any set E C B, the number of sets F C B such that card(EAF) < 
n/5is2”B(n/5,n, 1/2), where binomial probabilities B(k, n, p) are as defined 
before the Chernoff inequality (Theorem 1.15). If S, is the sum of n independent 
Rademacher variables X; taking values +1 with probability 1/2 each, then 
by one of Hoeffding’s inequalities (Proposition 1.12), defining “success” as 
Xi = —l, 


B(n/5,n, 1/2) = Pr (Sn > 3n/5) < exp(—9n/50) < e™"/6, 
Let d(E, F) := card(E AF). Thus for any E, card({F : d(E, F) < n/5}) < 


2”e™"/6, Recursively, choose a set E1, say Ø. Given E1, ..., Em such that 
d(E;, Ej) > n/5 fori # j,i, j < m, we have 
m 
card (UF : d(Ej, F) < n/5} | < m2” e™"/6 < on 
j=l 


if m < e"/©, Then we can choose Em+1 such that d(Em+1, Ej) > n/5 for all 
n/6 


j =1,...,m and continue until m = k > e”’° as stated. 


Now to prove Theorem 8.7, the first inequality follows from (7.2), and 
the second is also straightforward. For the lower bound on D, let us use 
again the construction in the proof of (8.10). Let A denote Lebesgue measure 
on If and ô := y f f dd for the f in (8.11). Then for each i, f f;dP > 
ém—*~4, Applying Lemma 8.8 and obtaining sets S with card(S) > m“/5 gives 
f fsdP > dm~°/6, and [| fs — frldP = f fsardP. So 


D (5m-“/6, Ga,K.ds dp) > exp (m*/6) : 
Thus if 0 < e < 6/6, since [x] > x/2 for x > 1, 


D (€,Gu,xdsdi,p) Z exp (166e) "e]" /6) 
> exp (2~4 (6/(6e))"/ /6) , 
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proving the statement about Gy xa. 
If C = J(f) and D = J(g), then 


dp(C, D) := P(CAD) > yX(CAD) = y Sif -—glda, so fore > 0 


Ni(e, Cla, K,d), P) 2 De, C(a, K, d), dp) = D (e/y, Ga, K.d—1> dia) ’ 


which finishes the proof of Theorem 8.7. 
To get lower bounds for the Hausdorff metric, the following will help: 
Lemma 8.9 [fa > 1 and f, g € Ga,K.a, then 
h (Jf, Je) = doup(f, g)/(2max(1, Kd)). 


Proof. Note that for a > 1, any g € Ga,K,a is Lipschitz in each coordinate 
by the mean value theorem with | g(x) — g(y)| < K|x — y| if x; = y; for all 
but one value of j. In the cube, one can go from a general x to a general y 
by changing one coordinate at a time, so g is Lipschitz with ||g||z < Kd and 
Lemma 8.1 applies. 


Corollary 8.10 Ifa > 1 andd =1,2,..., thenas« | 0, 
log D(e, C (a, K,d + 1), h) x 4. 


Proof. This follows from Lemma 8.9 and Theorem 8.4. 


Remark 8.11 For m =1,2,..., let IŻ be decomposed into a grid of mé 
subcubes of side 1/m. Let E be the set of centers of the cubes. For any A C [4 
let B C E be the set of centers of the cubes in the grid that A intersects. 
Then h(A, B) < d'/? /(2m), which includes the possibility that A = B = Ø. 
For 0 < € < 1 there is a least m = 1,2,... such that d'/?/m < £, namely, 
m = [d'/?/e]. It follows that 


D (e, ai h) < ltd") 


Hence for a < d/(d + 1), Corollary 8.10 cannot hold, nor can the upper bound 
for h in Theorem 8.4 be sharp. 


The classes C(a, K, d) considered so far contain sets with flat faces except 
for one curved face. There are at least two ways to form more general classes 
of sets with piecewise differentiable boundaries, still satisfying the bounds in 
Theorem 8.4. One is to take a bounded number of Boolean operations. Let 
vi, ..., Ug be nonzero vectors in R? where d > 2. For constants c1, ... , cz let 
H; := {x € R’ : (x, vj) = cj}, a hyperplane. Let x; map each x € R’ to its 
nearest point in Hj, mj(x) := x — ((x, vj) — c;)vj/v;l. Let T; be a cube in 
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H; anda, K > 0. Let f; be an affine transformation taking T; onto I?! For 
g € Go x.a-1 With g > O, let 


Jj(g) := {x ER? : a(x) € Tj, cj < (wj, x) < cj + afi j(x)))}. 
Let 
Cj := Cla, K, vj, cj) := {Jj(g): 8 € Gok a-1}- 


Then Theorem 8.4 implies that if C is a cube in R@ including all sets in C j, and 
P is a law on C having bounded density with respect to Lebesgue measure À, 
then for some Mj < 00, 


log D(e, Cj, dp) < log Ny(e,C;, P) < Mjete, 
We then have by Theorem 8.2 the following: 


Theorem 8.12 Let d > 2 and let Co := nC) or Co := Li_,C;, for Cj as 
just defined. Then for some M < œo, 


log D(e, Co, dp) < log N7(é, Co, P) < Mele, 


By intersections or unions of k sets in classes C; (with k depending on d), one 
can obtain sets with smooth boundaries (through order œ) such as ellipsoids; 
see Problem 5. Unions work more easily. One can also get more general sets, 
since, e.g. for œ > 1, the minimum or maximum of two functions in Ga,K,a 
need not have first derivatives everywhere and then will not be in G, x,a for any 
y > landk < œ. 

Recall that a C® real-valued function on an open set on R¢ is one such 
that the partial derivatives D? f exist for all p € N@ and are continuous. For 
functions f := (fi,..., fe) into R*, for f to be C% means that each fj 
is. Another way to generate sets with boundaries differentiable of order œ is 
as follows. The unit sphere S7! := {x € R°: |x| = 1} is a C® manifold, 
specifically as follows. S^! is the union of two sets A := {x € S41: x, > 
—1/2}andC := {x € S4!: xı < 1/2}. There is a 1-1, C% function y from 
{x € R%! : |x| < 9/8} into Rf, with derivative matrix (8p; /Axj}24 j_, of 
maximum rank d — 1 everywhere, such that y takes By_; := {x € RE! : 
|x| < 1} onto A. Let n(y) := (—Wi(y), 20), .--, WaQ’)). Then the above 
statements for y and A also hold for 7 and C. 

For 0 <a, K < œ let Fy (S17!) be the set of functions h: S! —> R 
such that for Bg_1 := {x e R5! : x| <1}, hopandhone Fa x(Baı), 
recalling that f o g(y) := f(g(y)). Let F.(S4-!) be the set of functions 
h = (hı, ... , ha) such that h; € Fux (S4-') foreach j = 1,...,d. 

Two continuous functions F, G from one topological space X to another, 
Y, are called homotopic iff there exists a jointly continuous function H from 
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X x [0, 1] into Y such that H(-,0) = F and H(.,1) = G. H is then called 
a homotopy of F and G. Let I(F) be the set of all y € Y, not in the range of 
F, such that among mappings of X into Y \ {y}, F is not homotopic to any 
constant map G(x) = z Æ y. 

For a function F let R(F) := ran(F) := range(F)and C(F) := I(F)U 
R(F). 

For example, if F is the identity from S¢~! onto itself in Rf, then (F) = 
{y: |y| < 1} by well-known facts in algebraic topology, e.g., Eilenberg and 
Steenrod (1952, Chapter 11, Theorem 3.1). 

Let I(d, æ, K) := {I(F): F € FO. (S4-!)} and K(d, a, K) := {C(F): 
Fe FPS} Then /(d, a, K) is a collection of open sets and K(d, a, K) 
of compact sets, each of which, in a sense, have boundaries differentiable of 
order a. (For functions F that are not one-to-one, the boundaries may not be 
differentiable in some other senses.) For K(d,a@, K) and to some extent for 
I(d, a, K) there are bounds as for other classes of sets with a times differen- 
tiable boundaries (Theorem 8.12): 


Theorem 8.13 For eachd =2,3,..., K > l anda > 1l, 


(a) there is a constant H4 «a,x < œ such that forO < £ < 1, and the Hausdorff 
metric h, 


log D(e, K(d, a, K), h) < Haa g /e® P. 


(b) For any ġ < œ there is a there is a constant Agiq,K,z < © such that for 
any law P on R? having density with respect to àf bounded above by ¢, for 
0<e<l, 


max(log N; (£, K(d, a, K), P), log N7(e, I(d, a, K), PY) 
< Adak, /6®7 
The proof will follow from a sequence of lemmas. 
Lemma 8.14 If H is a homotopy of F and G, then I(F)AI(G) C R(H). 


Proof. Suppose y € I(F) \ Z(G) and y is not in the range of H. Then F and G 
are homotopic as maps into Y \ {y}. Homotopy is clearly a transitive relation. 
Since G is homotopic to a constant map into Y \ {y}, so is F, a contradiction. 
The Lemma is proved. 


Lemma 8.15 Jf F is a continuous map from a compact Hausdorff topological 
space K into a Hausdorff space Y, then C(F) is closed. 


Proof. If y € C(F), then there is a homotopy H of F to a constant map 
G(x) = z into Y \ {y}. Then R(H) is compact, thus closed (RAP, Theorem 
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2.2.3 and Proposition 2.2.9). Clearly Z(G) = Ø, so by the previous lemma, 
I(F) C R(H). Then the open complement Y \ R(H) C Y \ C(F), so Y \ C(F) 
is open and C(F) is closed. 


Lemma 8.16 Jf F is a continuous map from a compact Hausdorff topologi- 
cal space into some R£, then I(F) is open, and its boundary is included in 
R(F). 


Proof. Let x € I(F). Then for some 6 > 0, d(x, y) < 26 implies y ¢ R(F). 
Let d(x, y) < 6. Then there is a homeomorphism g of R? which leaves {u : 
|u — x| > 26}, and so the values of F, fixed and takes x to y. To define such 
a g we can assume x =0 and ô= 1/2, let g(u) := u for |u| > 1 and 
g(u) := u + y(1 — |u|) for |u| < 1. Then g is the identity for |u| > 1 and 
is continuous, with g(0) = y. Also, g is 1—1 since |g(u)| < 1 for |u| < 1, 
and if g(u) = g(v) with |u|, |v| < 1, then u — v = y(|u| — |v|) and |u — v| < 
\(dqu| — |v|)|/2 < |u — v|/2, so u = v. Thus y € I(F), so I(F) is open. Since 
C(F)is closed by Lemma 8.15, it follows that the boundary of I (F) is included 
in R(F). 


Recall that for a metric space (S, d), set A C S and ô > 0, the 6-interior 
of A is defined by ¿A := {x: d(x, y) <6 implies y € A}, and the ô- 
neighborhood by A? := {y: d(x, y) <6 for some x € A}. 


Lemma 8.17 For continuous functions F, G from S! into R4, if ds (F, G) 
:= sup{|F(u) — G(u)|: u € S4!} < 6, then 


sI(F) C I(G) C C(G) c C(F)’. 


Proof. If x € sI(F) and x € R(G), then d(x, y) < ô for some y € R(F), 
so y ¢ I(F), a contradiction. So x ¢ R(G). For 0 < t < 1 and u € S4-!) let 
H(u,t) := (1 — t)F(u)+ tG(u). Then H is a homotopy of F and G, and 
R(H) C R(F)’, but I(F) A R(F) = Ø, so x ¢ R(H). Thus by Lemma 8.14, 
x € I(G). 

Next, let y € C(G). If y € R(G) then y € R(F)? C C(F)°. Otherwise y € 
I(G). Then y € I(F) or by Lemma 8.14, y € R(H) C R(F)’. 


Lemma 8.18 For any continuous function F from a compact Hausdorff space 
K into R!, C(F) \ s1(F) C R(FY’. 


Proof. Letx € C(F)? \ sI (F). Suppose d(x, R(F)) > 6. Then |x — y| < 6 for 
some y € I (F). Since the boundary of I (F) is included in R(F) by Lemma 8.16, 
the line segment {tx + (1 — t)y : O < t < 1} C I(F). It follows that x € I(F) 
and then likewise that z € /(F) whenever |x — z| < ô. Thus x € ;/(F), a 
contradiction. 
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The Lipschitz seminorm ||F||z is defined for functions with values in Rf 
just as for real-valued functions. Let vz be the Lebesgue volume of the unit ball 
in RE. 

Lemma 8.19 For k = 1,2,..., if (1, d) is a metric space, ô > 0, for some 
M < œ, D(6,T,d)< M8!—, and F is Lipschitz from T into R*, with IF ll < 
k, then 


AM(C(FY \ sI(F)) < vM (K + 2)*6. 


Proof. For the usual metric e on IR‘, we have D(kô, R(F), e) < Mé!-*. 
It follows that D((« +2)5, R(F)*,e) < Mé!-*. Lemma 8.18 gives the 
conclusion. 


Proof of Theorem 8.13. By the definitions, a function F € F{“.(S4-') is given 
by a pair Fa), Fo) of functions Fo) := (Foy, --., Foja) where each Foji € 
Fa x(Ba—1) and B; is the closed unit ball in Rİ. Since w > 1, each Fojji is 
Lipschitz with || F(j); ||, < K, so each Fj) is Lipschitz with || Fyj)||, < dK. Let 
T := Tı UT, bea union of two disjoint copies T; of Ba_1, with the Euclidean 
metric e on each and e(x, y) := 2forx € T;, y € T;,i € j.LettingG := Fj 
on T;, j = 1,2, gives a function G := Gp on T with 


IGllz < max Fini, < dK. (8.12) 

Let 0 < € < 1. Then there are D(e, Bj, e) disjoint balls of radius ¢/2, 
included in a ball of radius 1 + (¢/2). It follows by volumes that 

D(e, Bj,e) < [(2+8)/2VQ@/e) < GB/s). (8.13) 


By Theorem 8.4 for the ball case, for any K > 1, d > 2 anda > 1, there is a 
C = C(K,d, a) < œ such that for 0 < ô < 1, 


log D(8, Fa,x(Ba-1), dsup) < C/s@-Vie, 
It follows from the definitions with T := Tı U T that 
log D6, Fu,x(S4-"), dsup) < 20/86- 
and thus that for 0 < ô < 1, 
log D(8, FÈS), dup) < 2C(d/8)6 0'e, 


Given 0 < ô < 1 take a set of functions fi,..., fm € FPS) such that 
fori Æ j, dsa( fi, fj) > 5/2 and maximal m where m < exp(Ba/5~?/*) and 
Ba := 2C -(2d)4¢-/*, The brackets [51(f;), C(fj)?] for j = 1,...,m cover 
I(a, K, d) and K(a, K, d) by Lemma 8.17 (some sets 5/(f;) may be empty). 
If dsup(8g, fj) < 8/2, then by Lemma 8.17, C(g) C C(f;)? and C(f;) C C(g)’, 
so h(C(g), C(fj)) < 6 and part (a) of Theorem 8.13 follows. Then Lemma 
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8.19 applies with M = 2. 3¢-1 by (8.13) and k = Kd by (8.12), and holds for 
P in place of àf with an additional factor of ¢. Theorem 8.13(b) then follows 
with 6 := e/[2vg¢ -3¢-!(Kd + 2)¢] and Ada,K,t (= Bal2vgé - 34-1(Kd + 
2)4]6-D/e, 


Corollary 8.20 For any law P on Rf, d > 2, having bounded density with 
respect to Lebesgue measure, and K < œ, 


(a) (Tze-Gong Sun) I (d, a, K) is a Donsker class for P ifa > d — 1. 
(b) I(d, a, K) is a Glivenko—Cantelli class for P whenever a > 1. 


Proof. Apply 8.13(b) and, for part (a), Corollary 7.8; for part (b), the Blum- 
DeHardt theorem 7.1. 


8.3 Lower Layers 


A set B C Rf is called a lower layer if and only if for all x = (x1, ..., Xa) 
€ Band y = (y,..., ya) with y; < x; for j = 1,...,d, we have y € B. Let 


LLa denote the collection of all nonempty lower layers in R with nonempty 
complement. Recall that is the empty set and let 


LLaa = {Ll : L eLLa, LOI Ø}. 


Leta := a1 denote Lebesgue measure on J“. Thus d, is defined for any two 
Lebesgue measurable sets in R! by d, (A, B) := A1((AA B) N I“). The size of 
LLa, will be bounded first when d = 1 and 2. Let [x] be the smallest integer 
>x. 


Theorem 8.21 Ford = 1, 
D (e, LLia,h) = D (e, LLi, dy) = Ny (e, £L1,1, 4) = [1/e]. 
For d = 2 and any m = 1,2, ..., we have 


2m — 2 


m—1 


N; (2/m, LL21, 44) < ( ) < 22m-2 4", 


For0 < e < 1, N; (e, LL21, 44) < 4/1. Lastly, forO < s < 1/m, 
2m 
D(V2/m, LL21, h) < —1 < D(s, LL21, h) 
m 


and D (e, LLa, h) < gitv2/e_ 


Proof. For d = 1, sets in L£; ı are intervals [0, t),0 < t < l,or[0,t],0<t< 
1. For any € with O < e < 1, letm := m(e) := [1/e]. Then the collection of 
m brackets [[0, (k — 1)e], [0, ke]], k =1,...,m, covers ££;,; with minimal 
m for £, showing that N; (£, ££1,1, 4) = m. For any ô > 0, the points k(e + ô) 
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fork = 0,1,...,[1/(e + 5)], are at distances at least € + 6 apart. For 0 < x < 
y < 1, A((0, x], [0, y]) = d([0, x], [0, yl) = y — x, so 


De, LLi1,h) = Die, £L1,d,) = Die, [0,1], d) =: De) 


for the usual metric d. Letting ô | 0 gives D(e) = [1/e] = m (cf. Chapter 
Problem 5), finishing the proof for d = 1. 

For d = 2, decompose the unit square 7? into a union of m° squares S; j = 
[6 — 1)/m,i/m) x (j — 1)/m, j/m), i, j =1,...,m — 1, but for i = m or 
j = m, replace “i/m)” or “j/m)” respectively by “1].” For any L € LL21, let 
mL be the union of the squares in the grid included in L and L, the union of 
the squares which intersect L. Then „L C L C Lm, and both „L and Ln are 
in LL, U {Ø}, with Lm # Ø. 

For each m and each function f from {2,3,...,2m — 1} into {0, 1} taking 
the value 1 exactly m — | times, define a sequence S(f)(k),k = 1,...,2m — 1 
of squares in the grid as follows. Let S(f)(1) be the upper left square Sim. 
Given S(f)(k — 1) = Sij, let S(f)(k) be the square S;+ı,; just to its right 
if f(k) = 1, otherwise the square S; j—ı just below it, for k = 2, ...,2m — 
1. Then S(f)(2m — 1) is always the lower right square S,,;. Let Bm(f) := 

mTl S(f)(k). Let Am(f) be the union of the squares not in B,,( f), below and 
to the left of it, and Cm( f) := Am(f) U Bm( f). Here An(f) and Cm( f) belong 
to LL2,1 U {Ø} with C,,(f) 4 Ø. Also, if f # g, then A(Cm(f), Cn(g)) = 1/m. 

Let L € LL, UG. Let L be its closure and 


{i 


M := Mz := LU{(0,y): 0< y < 1}U{x,0): 0< x < 1}. 


Then M C I? is compact and is in LL. The range of x — y on M (or on I”) 
is [—1, 1]. For each t € [—1, 1] there is a unique (x, y) := (x(t), y(t)) € M 
with x — y = ¢ such that x + y is maximized. It follows from the lower layer 
properties that x(-) is nondecreasing and y(-) is nonincreasing. Let g(t) := 
(x(t), y(t))and G := {g(t): —1 <t < 1}. For —1 < s <t < 1 we have 


x(t) = yt)+t < y(s)+t = x(s)+t-s85, 


so x(-) is a Lipschitz function with ||x()||z < 1. Likewise ||y()l|z < 1. In 
particular x(-) and y(-) are continuous, and the curve G is connected. We have 
x(—1) =0, y(—1) = 1, x(1) = 1, and y(1) = 0. For t near —1, g(t) € Sim. 
If G intersects a square S;; for i < m or j > 1, then it next intersects, as t 
increases, one of the squares S;+1,; or Sj, ;~1. (It cannot go directly to Si+1,j—1 
since the upper left vertex of that square is in S;,1,;.) For t near 1, g(t) € 
Smi. Thus for some f as above, there is a sequence of squares S(f)(1) = 
Sim, ---, S(f)(2m — 1) = Smı intersected by G, and no other squares Sj; are. 
It follows that 


Am(f)C mL CLC Lin C Cn(P), 
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so that Lm \ mL C Bm(f), and a (Bn(f)) = (2m — 1)/m? < 2/m. Since the 
number of functions f is ( < 4”—| the first sentence of the Theorem for 
d = 2 is proved. For 0 < ¢ < 1, let m = [1/e]. Then m — 1 < 1/e, and the 
second sentence follows. 

For the statements about the Hausdorff metric h, let M € LL. be such 
that L := M N I? Æ Ø, let ƏM be the boundary of M and dyL := 3M N I’. 
Possibly ðmL = 0, if I ? is included in the interior of M. Otherwise 3m L equals 
the graph of a function (x(t), y(t)) defined fora < t < b where —1 < a < b < 
1, with x(a) = Oor y(a) = 1 or both, and x(b) = 1 or y(b) = 0 or both. Without 
loss of generality, we can assume that (—1/m, 1) € 0M and (1, —1/m) € 0M. 

There is a largest j = j(L) with 1 < j < m such that Sı; intersects L, and 
a largest i = i(L) with 1 < i < m such that S; intersects L. Given j and i, for 
each function f from {2,...,i + j — 1} into {0, 1} taking the value | exactly 
i — | times, we define Bm( f), Am(f) and C,,(f) as before, replacing A,,(f) if it 
is empty by {0, 0}. Then foreach L there isan f such that A,(f) C L C Cmamh P). 
If L and L’ have the same f, then A(L, L^) < /2/m and h(L, C,,(f)) < 
/2/m. The total number of possible functions f is 


a. a Se ( a 7 e”) e 

To see this, consider an (m + 1) x (m + 1) grid of squares in [—1/m, 1] x 
[—1/m, 1], giving (2) possible Cm+1ı( f), all but one of which intersect $1,1. 
Or, to see the equality of the first and last expressions in the display, consider 
the numbers of strings of m 0’s and m 1’s, beginning with m — i 0’s and ending 
with m — j 1’s, which summed over i and j from 1 to m give all such strings 
except the one with first m 0’s, then m 1’s. It follows that D(/2/m, LLa, h) < 
(2) — 1. Conversely, each set Cm( f) 4 Ø is in £21, and two distinct such sets 
are at distance at least 1/m apart for h, so the right-hand inequality in the 
last display of the theorem follows. For 0 < ¢ < 1 let m := [/2/e]. Then 


J/2/m < £ and (2) — 1 <4" <4!+V2/*, proving the last statement. 


Recalling again the definition (8.4) of x, for dimension > 2 we then have: 
Theorem 8.22 For eachd > 2, ase | 0, 


log D(e, LLa,1, h) x log D(e, LLa, dy) = log D(e, LLa1, dy) 
x log N; (e, LLa., 1) «x el 4, 


Proof. First, for the Hausdorff metric h, it will be shown that for some constants 
Cq With 1 < cg < œ, 


log D(e, LLa, h) < cae'~4 (8.14) 
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for 0 < £ < 1. For d = 2 this holds by the previous theorem. It will be proved 
for d > 2 by induction on d. Suppose it holds for d — 1, for d > 3. Given 
0 < e < 1, take a maximal number of sets L1, ..., Lm € LLa-1,1 such that 
h(L;, Lj) > £€/4 for i A j, where m < exp(cy_1(4/e)4~7). Let k := [3/2] 
and A € £Lq). For j =0,1,...,k let Aj := {xe€1I®™!: (x, j/k) € A} 
and Ag) := Aj x {j/k} C A. Then Aj = Ø or Aj € LLa-1,1. In the latter 
case we can choose i := i(j, A) such that h(A;, Li) < ¢/4. Let Lo := Ø 
andi := i(j,A) := 0 if Aj =Ø, so h(A;, Li) =0 < £/4 in that case 
also. 

Let A, B € LLa, and suppose that i(j, A) = i(j, B) for j =0,1,...,k- 
1. It will be shown that h(A, B) < £. Let x € A. Thereisa j = 0, 1,...,k— 1 
such that j/k < xa < (j + 1)/k. Let y := (x1, ..., Xa-1, j/k) € Aq). Then 
A; # Ø and h(A;, B;) < £€/2, so for some z € Bij) C B, we have |y — z| < 
2e/3 and |x — z| < k7! +2e/3 <£. So d(x, B)< £, and by symmetry, 
h(A, B) < £. Thus 


D(e, LLaa, h) < (m +1} < [exp(2cq_-1(4/e)* 7)" < explca/£®)) 


for ca := 2-44!cqg_1, so (8.14) is proved. 
For the metrics in terms of à we have the following: 


Lemma 8.23 Let 5>0 and let A,B € LLa with h(A, B) <6. Then 
AZ(AAB) < d'8. 


Proof. Let U be a rotation of R? which takes v := (1, 1,..., 1) into (0, 0, ..., 
d'/?). Let mg(y) := (Yı, -.., Yd-1, 0). Let C be the cube C := U[I4] := 
{U(x): x € If}. Each point of If or C is within d'/*/2 of its respective 
center. Thus each point z € H := zy[C] is within d!/?/2 of 0. Also, for any 
zé€R!, {te R: (z,t) € C} is empty or a closed interval A(z) < t < j(z). 
Let C, := {(w,t) eC: w= z}, a line segment. The intersections of U[A] 
and U[B] with C, are each either empty or line segments with the same lower 
endpoint (z, h(z)) as Cz, so the two sets are linearly ordered by inclusion. Thus 
the intersection of U[AJAU[B] = U[AAB] with C, is some line segment 
SaB.. It will be shown that S4,g,z has length < d!/*8. Suppose not. Then by 
symmetry we can assume that there is some ¢ > ô anda point x € B \ A such 
that v := x+¢(1,1,...,1) € B. The orthant © := {y: y; >x; for all 
j =1,...,d} is disjoint from A. But, the open ball of radius ¢ and center v 
is included in ©, contradicting h(A, B) < ô. Now, H is included in a cube of 
side d!/? in R”! with center at 0, so by the Tonelli—Fubini theorem 


.4(AAB) < Gd P \(d Ai! < ôd., 


proving the Lemma. 
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Returning to the proof of Theorem 8.22, from the last Lemma and (8.14) it 
follows that for each d = 2,3,... and some Cy < œ, forO < e < 1, 


log D(e, LLa, dy) = log D(e, LLa1, d) < Cae! ™’. 


Next, consider the remaining upper bound statement, for Nz. The angle between 
v := (1,1,..., 1) and each hyperplane x; = 0 is 


cos"! (((d — 1)/d)") = sin! d-"? = tan" — 17"), 


Thus for any nonempty lower layer B 4 R¢ and point p on its boundary, U [B] 
includes the cone 


fx: [xa — qal < (qa — xa) (d — 1I7'}, 


where q := U(p) and recalling that xa) = (x1, ..., Xa—1). Hence the boundary 
of U[B] is the graph of a function f : R¢~! —> R where for any s, t € R? !, 


fs)= f()-K\s—tl, K:=@-1)'?, 


Hence, interchanging s and z, | f(s) — f(| < K|s — t|. So || fll, < K. Let 
If) := {x: —œ < xa < f(xa)}. Thus for each B € LLa, we have 
U[B] = J(fyn U[If] for a function f = fp on R^! with fle < K. We 
can restrict the functions f to a cube T of side d'/” centered at the origin in R¢! 
parallel to the axes, which includes the projection of U[I“]. We can also assume 
that || fs Ilsup < d! for each B, since replacing f by max(—d!/?, (min(f, d'/”)) 
does not change 7(f) N U[/“], nor does it increase || f ||, (RAP, Proposition 
11.2.2(a), since ||g||z = 0 if g is constant). Now, apply Theorem 8.4 for a = 1 
and d — 1 in place of d, where by a fixed affine transformation we have a 
correspondence between 7° and the cube T. In this case, 7⁄7! means the set of 
points (x1, ..., Xg_1, 0) of R? such that O < x; < 1 for j = 1,...,d — 1. 

Since f < g implies J(f) C J(g) and J(f) NUU] C J(g) NU[T4], 
the bracketing parts of Theorem 8.4 imply the desired upper bound for 
log N; (£, LLa,1, 4) with —(d — 1)/a = 1 — d. This finishes the proof for upper 
bounds. 

Now for lower bounds, it will be enough to prove them for D(e, La, dh.) 
in light of Lemma 8.23 and since N7(e,...) > D(e,...). The angle between 
v =(1,1,..., 1) and each coordinate axis is 


64 := cos 'd7'/? = sin! ((d — 1)/d)"”) = tan 1((d — 1)"). 


Thus if f : R! — R satisfies || f |, < (d —1)~'”, then L := UHI (FY 
is a lower layer: if not, then for some x € L and y ¢ L, x; = y; for all i except 
that x; < y; for some j. Then U transforms the line through x, y to a line £ 
forming an angle 04 with the dth coordinate axis. Writing £ as tg = h(t(a)) we 
have ||h||, = cot @q = (d — 1)~'/, which yields a contradiction. 
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Recall (Section 8.2) that for ô > 0 and a metric space Q, 


Fis(Q) := {f : Q > R, max(||fllz, Il fllsup) < 4}. 


Let ô := (d—1)~!/*d7!. Let Q be a small enough cube in R¢~! with center 
at 0. Then for each f € Fi,5(Q), we have $v +UI) C I’. For such ts 
If) < Su + U~'(J(f)) is an isometry for h and for d, and preserves inclu- 
sion. Each f can be extended to R¢!, preserving || f||, < 5 (RAP, Theorem 
6.1.1). So the lower bound with dsup in Theorem 8.4 gives, via Lemma 8.1, 
the lower bound with h in Theorem 8.22. Theorem 8.7, likewise adapted from 
T4—! to Q, gives the lower bounds with A, proving Theorem 8.22. 


Corollary 8.24 For any law P on R! having a bounded density with respect 
to Lebesgue measure, LLa 1 is a Glivenko—Cantelli class. 


Proof. This follows from the statements about N; in Theorem 8.22 (and in 
the degenerate case d = 1, in Theorem 8.21) and the Blum—DeHardt theo- 
rem 7.1. 


For what d does the Donsker property hold under the same hypotheses? For 
d = 1, it does easily. For d > 2, The hypothesis of Corollary 7.8, because of 
the x? in it, does not follow from Theorem 8.22. In fact, the Donsker property 
fails for d = 2 as will be shown in Theorem 11.10. 


8.4 Metric Entropy of Classes of Convex Sets 


A C? function f on Rf is convex if and only if its Hessian matrix 3? f/x; dx; 18 
everywhere nonnegative definite. For a general convex function, these deriva- 
tives need not exist everywhere, although the Hessian in the generalized sense of 
Schwartz distributions exists as a nonnegative definite matrix-valued measure 
(Bakel’man 1965, Reshetnyak 1968; cf. Dudley 1980). Thus convex functions 
are comparable to functions differentiable just of order 2, not necessarily for 
any a > 2. It will be seen that metric entropy or capacity of convex subsets of 
a given bounded open subset of R? for d > 2 is of the same order as that of 
subsets with boundaries given by twice differentiable functions. 

Let C4 denote the class of all nonempty closed convex subsets of the open 
unit ball B(O, 1) := {x: |x| < 1} in Rf. Let À be the uniform Lebesgue measure 
on R¢. Upper and lower bounds will be given for the metric entropy of Cy for 
the metric d, and for the Hausdorff metric A. 


Theorem 8.25 (E. M. BronStein) For each d > 2 we have 
log D(e, Ca, d,) x log D(e, Ca, h) x et as e 40. 
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Remark. For d = 1, C; is just the class of subintervals of the open interval 
(—1, 1). Then it is rather easy to see that 


D(e, C1, dy)  D(e, Ci, h) x &7°. 
Before proving the theorem, let us prove from it: 


Corollary 8.26 In Rf, ford > 2, if P is a law whose restriction to B(O, 1) has 
a bounded density f with respect to Lebesgue measure, then 


(a) log N;(e, Ca, P) = O(e"-%/?) as e | 0. 

(b) (E. Bolthausen) For d = 2, Cz is a Donsker class for P. 
(c) For any d, Cq is a Glivenko—Cantelli class. 

(d) Ifalso f > v on B(O, 1) for some constant v > 0, then 


log Nr(é, Ca, P) =x el-D/2 as e ļ 0. 


Note. For d = 3, the class C3 of convex sets is not a Donsker class for, e.g., the 
uniform distribution on the unit cube, as will be shown in Theorem 11.10. 


Proof. Let 0 < f(x) < V < ov for all x. For B C R? and 5 > 0 recall that 
3B :={x: y € B whenever |x — y| < 5} and B® := {x : d(x, B) < ô}. Fora 
closed set B, we also have ¿B = {x : y € B whenever |x — y| < ô}. Then B® 
is always open and ;B is always closed. We have ¿B C B C B®. For any two 
sets B, C, if h(B, C) < ô, then C C B°. It will be shown next that if B and C 
are closed and convex, then also 5B C C: if not, letx € 5B \ C. Take a closed 
half-space J > C with x ¢ J (RAP, Theorem 6.2.9). On the line through x 
perpendicular to the hyperplane bounding J, on the side opposite J, there are 
points y € B \ C®, a contradiction. So ¿B C C C B°. For d), the following 
will help. For any set B, let 0B denote the boundary of B. 


Lemma 8.27 For any d = 1,2,..., there isa K = K(d) < œ such that for 
any B € Ca and 0 <5 < 1, à (B° \ 5B) < Kô. 


Proof. If B is convex and has empty interior, then it is included in some 
hyperplane (RAP, Theorem 6.2.6). Then A (B° \ 5B) = à (B°) < K18 where 
K; is twice the (d — 1)-dimensional volume of a ball of radius 2 in R^- !. For a 
convex set B and £ > 0, the set B® is the vector sum B + € B(0, 1). The volume 
A(B*) can be written as a polynomial in ¢, 


à (B®) = ACB) + C\(B)e + Ca(B)e? +--+ + Cae! 


where the coefficients C; are known as mixed volumes of B and B(0, 1) (e.g., 
Eggleston 1958, pp. 82-89; Bonnesen and Fenchel 1934, pp. 38, 46-47). Here 
C\(B) = lim, | 0(A(B*) — A(B))/e is the (d — 1)-dimensional surface area of 
0B (Eggleston 1958, p. 88), and Cg = A(B(0, 1)). 
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All the mixed volumes C;(B) are nondecreasing functions of the convex 
set B (Eggleston 1958, Theorem 42 p. 86; Bonnesen and Fenchel 1934, 
p. 41). Thus, all the C;(B) are maximized for B C B(0, 1) when B = B(O, 1). It 
follows that the derivative dA(B*)/de is bounded above uniformly for B € Ca 
and 0 < £ < 1 by a constant Ky = K2(d), so that A(B* \ B) < Kae for all 
B e Cq4and0 <e< 1. 

Now suppose B has an interior. Then B? \ 5B = (0B)°. For a set of points 
on 0B at distance more than ô apart, the balls of radius 6/2 with centers at the 
points are disjoint, and the outer halves of these balls cut by support hyperplanes 
to B at the points are outside of B. Thus for 0 < ô < 1 


(5/2)Ky > 4 (B*”? \ B) = $Cq(5/2)* D(8, aB, p) 
where p is the Euclidean distance. Thus 
D(8, OB, p) < 21 K817! / Ca. 
Then for0 < ô < 1 


A((8B)®) < D(8, OB, p)Ca(25)* < 41 K8, 


and the Lemma follows. 


Now continuing the proof of Corollary 8.26, given 0 < ô < 1, let Ma(ô) 
be a ô-net in Cg for h with cardinality at most exp(Ay6U-%/ 2) where Aq is a 
large enough constant. The brackets [s B, B°] for B € Na(8) cover Cy: in other 
words, for any C € Cy there is sucha B with ¿B CC C B®, as seen just before 
Lemma 8.27, and A(B® \ 5B) < Kô, so P (B° \ 5B) < KVô, and 


Ni(e,Ca, P) < D(e/(KV),Ca,h) < exp (Aa (e/(K VE?) 


for ¢ small enough, proving part (a). 

Given part (a), part (b) follows from Corollary 7.8 and part (c) from the 
Blum—DeHardt theorem 7.1. 

For part (d), we have 


Ni (£, Ca, P) => D (e, Ca, dp) 


D(eé/v, Ca, d}) => exp (cale / v) =) 


2 
= 


for some cg > 0 and all £ small enough, which finishes the proof of Corollary 
8.26 from Theorem 8.25. 


Now Theorem 8.25 will be proved. Let 0 < £ < 1. For any set C C Rf and 
r > 0 let C” := {x € R? : d(x, C) < r}. Then the open set C” is included in 
the closed set C’!, aC’ = 9C"!, a closed set, and A(C”, C") = 0. 


Lemma 8.28 For any C, D € C4 and r > 0, h(C"!, D") = A(C, D), in other 
words p, : E — E"! is an isometry for h. 
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Proof. Let s > 0. It will be shown that D C C* if and only if D"! c (C"!)’ = 
C’'ts, “Only if” is straightforward. To prove “if,” suppose not. Leta € D \ C’. 
There is a unique point of q of C°! closest to a: there is a nearest point q since 
C*! is compact, and if b is another nearest point, then (q + b)/2 € C"! since C 
is convex and (q + b)/2 is nearer to a, a contradiction. (Possibly q = a.) Now 
q € 0(C°!). If q = a, take a support hyperplane H to C°! at q (RAP, Theorem 
6.2.7). If q # a, then the hyperplane H through q perpendicular to the line 
segment aq is a support hyperplane to C°! at q (if there were a point c of C*! on 
the same side of H as a, then on the line segment cq there would be a point of 
C°! closer to a than q is, a contradiction). Let p be a point at distance r from a 
in the direction perpendicular to H and heading away from C°!. Then p € D”, 
but p ¢ (C9) = C"+5, a contradiction. So “if” is proved. Since C and D can 
be interchanged, the Lemma follows. 


For r > 0, @, is a useful smoothing, as it takes a convex set D, which 
may have a sharply curved boundary (vertices, edges, etc.) to a convex set D” 
whose boundary is no more curved than a sphere of radius r, and so will be 
easier to approximate. 

Now, for a given C € Cg and e > 0 let 


NAC) := {D € Ca : h(C, D) < 8}, 
N.(C) := {D € NAC): C c D}. 


For any convex C, ¢24¢ is an isometry from \V/,(C) into No(C 2), 

A sequence of lemmas will be proved. Here |-| will denote the usual 
Euclidean norm on R¢. Let Cu,1,3 denote the class of all closed convex sets 
C in RÊ such that B(O, 1) C C C B(0, 3). Note that if C € C4, then clearly 
c” € C413 


Lemma 8.29 Let E €Cqi3. Let x € OE and let H be a support hyperplane 
to E at x. Let p be the point of H closest to 0. Then |p| > 1 and ZOxp > 
sin“! (1/3). 


Proof. Existence of support hyperplanes is proved in RAP (Theorem 6.2.7). 
Clearly H is disjoint from B(O, 1), so |p| > 1. Since |x| < 3 and 0px = 7/2, 
the Lemma follows. 


Lemma 8.30 Jf E € Cq1,3, r > 0, and z is a point such that z € E" \ E, let 
x(z) be the point on the half-line from 0 to z and in ðE. Then |z — x(z)| < 3r. 


Proof. Note that x = x(z) is uniquely determined since E D B(0, 1). Apply 
Lemma 8.29. Let zı be the point of H closest to z. Then |z — zı| < r. The 
vectors z — zı and p are parallel, so @ := Zp0z = 20zz1, and |z — x| = |z — 
aillxl/Ipl < 3r. 
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Lemma 8.31 Suppose C € Cy, y € ðC, x € AC’, and |x — y| = 2. Then for 
any two-dimensional subspace V containing x, V N B(y, 2) is a disk containing 
0 of radius at least 2/3. 


Proof. Clearly 0 € B(y, 2) C C?. Apply Lemma 8.29 again. Then H is also 
a support hyperplane to B(y, 2) at x, so x — y is orthogonal to H and in the 
same direction as p. Let q be the point on the segment [0, x] closest to y. 
Then 0 := Z0xp = Zqyx, sin@ > 1/3, and so |x — q| > 2/3. It follows that 
VN B(y, 2) D VN Bg, 2/3). 


Now polyhedra to approximate convex sets will be constructed. Let Wy be 
the cube centered at 0 in R? of side 2/d'/?, parallel to the axes, so that the 
coordinates of the vertices are +1/d!/*. Recall that f ~ g means f/g > 1. 
Given £ > 0, decompose the 2d faces of Wy into equal (d — 1)-cubes of side 
Sq i= Sa(e) where sy ~ c(e/d)'/? as e | 0 and c := 1074/(d'/*(d — 1)), so 
Sa ~ e!/21074/(d(d — 1)). Specifically, let sg := 2/(d/*kq) where kg := kg(€) 
is the smallest positive integer such that sg < gl/ 710-77 (d(d — 1)). Then for 
O0<e<l, 


e!/210-4 /d? < sa < 6/7104 /(d(d — 1)). 


Let L4 be the set of all (d — 1)-cubes thus formed. 

The diameter of each cube in Ly is d'/*sq < ¢'/710~4/ (d!/2(d — 1)). The 
next fact follows directly, by the law of sines, since 0 < £ < 1 and d > 2, and 
sin"! x < 1.1x for 0 < x < 107%. 


Lemma 8.32 (a) For any cube in L4 and any two vertices p and q of the cube, 
Zp0q < sin~!(10~4e1/2 /(d — 1)) < (1.1)10~4e!/7 /(d — 1). 
(b) The total number of vertices of all the cubes in La is less than 
2d (kale) + 1)! < Kae“? 
where Ka := 2d ((2 - 104 + 1) PA). 


Next, there is a triangulation of each cube in L4, in other words, a decom- 
position of the cube into (d — 1)-simplices, with disjoint interiors, where each 
simplex is a convex hull of some d of the 2%! vertices of the cube. That 
such a triangulation (without additional vertices) exists (is well known to 
algebraic topologists and) can be seen as follows. By induction, it will be 


enough to treat S x [0,1] where S is a simplex with vertices vo, ..., Up. 
Let a; := (v;,0), b; := (v;,1). Then for each i =0,1,..., p, the points 
ao, ..-, Ai, bi,..., bp are vertices of a (p + 1)-dimensional simplex S;. To 


see that these S; give the desired decomposition of S x [0, 1], note first that 
each point of a simplex is a unique convex combination of the vertices. 
For each point z of S x [0,1], z= (X; divi, x) for some unique x € [0, 1] 
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and à; > 0 with 7; A; = 1. Then z = )0;<; Midi + } i>; pibi, where u; > 0, 
pi = 0, and ae Ui + a Px = 1, if and only if u; = A; fori < j py = Àk 
fork > j, Aj = uj + pj, and x = es pr. Thus z is in S; if and only if 
ys ju SxS Pye j Ai- If both inequalities are strict, then j is unique. Every 
point of S x [0, 1] is in some Sj, and a point in more than one S$; is on the 
boundary of both. 

Let K be a convex set including a neighborhood of 0. For each vertex p; of 
a cube in L4, let H; be the half-line starting at 0 passing through p; and let v; 
be the unique point at which H; passes through the boundary of K. For each 
simplex S$; in the triangulation of the cubes in £4, let T; be the corresponding 
simplex with vertices v; in place of p;. Let ,(K) be the polyhedron with faces 
T;, in other words the union of the d-dimensional simplices which are convex 


hulls of T; U {0}. For d > 3, 2,(K) is not necessarily convex. 


Lemma 8.33 Let E € Cy13,6 > Qande > 0. Fori = 1, 2 let z; be points such 


that z; € E! \ E. Assume that 2z,;0z2 < 6 < 5 Then |z, — z2| < 56 + 96e. 


Proof. Let x; := x(z;) be the point where the line segment from 0 to z; intersects 
ðE, i = 1,2. By Lemma 8.30, |z; — x;| < 482, i = 1, 2. We have |z;| > 1 and 
|x;| > 1 for alli. 

By symmetry we can assume |x;| > |x2|. To get a bound for |x; — x2| we 
can assume x, and x2 are not on the same line through 0, or they would be equal. 
Take a half-line L starting at x; which is tangent to the unit circle 0 B(O, 1) at 
a point v and crosses the half-line from 0 through z2 at a point y. 

If v is between x; and y, then |x; — v| < tan < 26 since 6 < 5 < 7/4. 
So 1 < x| < [xi] < (1 +482)'? < 1 +28? < 1+ ô, and likewise 1 < |y| < 
1 + ô, so 


|x2 — v| < |x2 — yl + |y — v| < ô + 28 = 38 


and |x; — x2| < 56. 

Otherwise, y is between x, and v. Then |x; — x2| is maximized when |x2| = 
|x,| or x2 = y, since x2 must not be in the convex hull of {xı} U B(O, 1). If 
|x2| = |x|, then 


|x) — x2| < 2 (2sin $) < 26. 
To bound |x, — y| let ¢ := ZOx,v. Then sing > 1/2, and 


|x) — y| < tan (4 — ¢) —tan(% —¢ — ô) 
< ôsec? (Z — ¢) < 46. 


So |x, — x2| < 56 in all cases, and |z; — z2| < 56+ 96e. 


Lemma 8.34 Let E € C213. Suppose 2 < K < œ and 0 < & < 107? /(K — 
1). Let c := (1.1)1074/(K "(K — 1)). Let r > 2/3 and let a, b, a be points 
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of R? such that a ¢ E, B(a,r) C E, b is on the boundary of E and of 
B(«,r), d(a, E) < 162s, and B := Za0b < c(Ke)!?. Then d(a, B(a,r)) < 
50e. Also, |a—b| < .0008¢!/? /(K — 1) and Zaab < D(K)e!/* where 
D(K) := .002/(K — 1). 

Proof. Apply Lemma 8.29 at x = b. The tangent line L to E at b is also tangent 


to the circle 0 B(a, r). 
Let y := ZObp. Then y > sin™!(1/3) > 1/3. We have 


B <c(Ke)!/? = (1.1)10~4e"/2/(K — 1) < (1.1)107!%. (8.15) 


Let H be the half-plane including E bounded by L. 
First suppose a ¢ H. Then d(a, H) < 16e. Let the line from 0 to a intersect 
L at a point n. If 7 is between p and b, then y + B < x/2 and 


In — b| = |pl(cot y — cot(y + B)) < 3B esc? y < 278. 
If p is between y and b, then 


In — b| = |p| (cot y + tan (y +B- z)) 
= |p|(coty — cot(y + B)), 


where now 27/2 < y + É < x. Thus 


In — b| < |p|B max (csc? y, csc?(y + B)) 
< 36 max (csc? y, esc? (3 + 107°)) 
<27B < NeKe). 


The other possibility is that b is between 7 and p. Then 
In — b| = |p\(cot(y — £) — cot y). 
Now £ < (1.1)10~? and y > sin7!(1/3) imply sin(y — B) > 0.333, so 
In — b| < 3B/(.333)? < 286 < 28c(Ke)!/?. 
For any ordering of b, p and n, and under the same condition on €, 
la — n| < 16e/sin(y — B) < 49e. 


Next, let x be the distance from a varying point ¢ on L to b. Then the 
distance y from ¢ to the circle ə B(«, r) satisfies y = (r? + ey —r. Now 
(2 +1)'? <r +t for t > 0 and r > 2/3, so 0 < y < x? for all x. So, the 
distance from a to B(«, r) is at most Ce for C = 49 + 287c?K < 50, giving 
the first conclusion fora ¢ H. 

A line W through a, orthogonal to the line V through 0 and b, meets V 
at a point £. Then Zbaé = y > sin™!(1/3), so |æ — £| < r (1 — 4)”. Let q 
be the point on the circle |g — œ| = r and the line W, on the same side of a 


as £ is. Then |q — £| > 2 (1 = (3)'”) > .03. By (8.15), B < tan-'(.01), so 
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the line A through 0, 7 and a must intersect B(a, r). Then since E is convex, 
a ¢ E, 0 € E, and B(a,r) C E, fora € H, d(a, B(a,r)) is maximized when 
a = ņon L,and d(a, B(a,r)) < Ce for the same C < 50 as before. So the first 
conclusion is proved. 

Lemma 8.33 with ô = £ gives 


la — b| < 5B + 96e < .0008e!//(K — 1), 


so Zaab < sin™! (ja — b|/(2/3)) < .0022!//(K — 1). So Lemma 8.34 is 
proved. 


Lemma 8.35 Let a € R?, B := B(a,r) C R’, where r > 2/3 and 0 € B, so 
jæ| <r. Let 2< K <œ and0 < e< 10-1? /(K — 1)”. Let a;, i = 1,2, be 
points not in B with d(a;, B) < 50e, |a; — a| < D(K)e!?, and Zaiga < 
2D(K)e'/?, where again D(K) = .002/(K — 1). Let SD B be a bounded 
convex set with a; € 0S, i = 1,2. Let Sı be the triangle with vertices 0, ay, 
az. Let W be the convex wedge with vertex 0 bounded by the half-lines from 0 
through a, and az. Then 
E 
h(S A W, Si) < K-D 

Proof. Since S; C SN W, it will be enough to show that for zo E€ ASM W, 
d (zo, [a1, a2]) < €/(9(K — 1)) where [a;, a2] is the line segment joining a; to 
a2. 

It is easy to see that a}, a2, and a are not all on a line, unless a; = a2, when 
the result clearly holds, so assume they are not on a line. Let L; be a tangent 
line to B, at a point b;, through a;, where of the two such tangent lines, L; is 
the one for which Zb;aa; < Zb;aa3_;, i = 1, 2, and b; is not in W. 

Now ja; — a| < r +50e, so 


Zb;aa; < cos~'(r/(r + 50e)) < cos7!(1 — 758). 


By derivatives, cosx < 1 — jx? + qx for all x, so cosx < 1 — 11x?/24, 
0 <x < 1. Thus cos™! (1 — 11x?/24) < x, and cos™!(1 — 758) < (164e)!/”. 
Now2D(K)e!/? < 1078, and(164¢)!/? < 1074, so the angles Zbjaa;,i = 1, 2, 
and Żaı&œa add up to 


Zbiab < 2164)! +2D(K))e!/? < .001 < x. (8.16) 


So the lines L; intersect, in a unique point m. 

Now zo is in the triangle ajam because: zo € W anda, a2 € 0S, bj, bo E S, 
so Zo cannot be in the triangle œaaz unless it is on [a;, a2]. Also if zo were on 
the side of L; away from a, then a; ¢ 0S, a contradiction, i = 1, 2. 

Next, Zmaja + maa; = m — bımb = (3 — Zamb,) + 
(3 — Zamb) = Lbb. 
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We have d(zo, [a1, a2]) < d(m, [a1, a2]). Then d(m, [a1, a2]) < |aı — a| 
tan maia, and by (8.16) 


tan maja < 2/maja < (52 +4D(K))e!/”, 


and |a; — a| < D(K)e!/? by assumption. So 


d(m, [a, a2]) < (52D(K) + 4D(K)’)e < £/(XK — 1)). 


Lemma 8.36 Let d>2, CeCa 0<e<10?/(d-1?, and N'e 
Nae (c°). Let N be the polyhedron 1,(N’). Then h(N, N’) < €/9. 


Proof. For each j, the simplex S; and the origin span a convex cone W,, the 
union of all half-lines with endpoint 0 passing through Sj. It suffices to show 
that for each j, 


h (N'A W}, NA Wj) < €/9. 


Fix such a cone W = Wj; and simplex S = S;. Fors = 1,...,d, let W be the 
union of the s-dimensional faces of W, so that W” = W, W° is the union of 
half-lines through 0 and vertices of S, and so on. It will be shown by induction 
on s that if z € W® N N’, then 


d(z, N) < e(s — 1)/(9(d — 1)). (8.17) 


For s = d this will give the desired conclusion. 

For s = 1, (8.17) holds by definition of N. Suppose it holds for a given s. 
Take any z € N’ N WSD, Let x = x(z) be the point at which the half-line from 
0 through z intersects 3C?. There is a point B € dC such that |x — $| = 2 : 
to see this let H be a support hyperplane to C? at x. Let Hı be a hyperplane 
parallel to H at distance 2 in the direction toward C. Then it is easily seen 
that H is a support hyperplane to C at a point £, the nearest point in H to x. 
Clearly B(B,2) c C?. 

Let F;+; be an (s + 1)-dimensional face of W containing z. Let m2 be a 
two-dimensional subspace with z € m2 NW C F,41. Then the two rays at the 
edges of the wedge 772 N W are included in W). Let these rays intersect dN’ 
at points u, v. 

By Lemma 8.31, the disk 72 N B(6, 2) contains 0 and has radius at least 2/3. 
So it is a disk m N B(a,r), a € m2, with |x —a| =r > 2/3. Then u, v € dN’ 
implies u, v are not in C? and so not in B(a, r). Now apply Lemma 8.34 to 
E = C? N m, with K = d, first for (a, b) = (u, x), then for (a, b) = (v, x). To 
justify the application we need the following: 


Claim d(u, E) < 16¢ and d(v, E) < 16e. 


To prove the claim, there is a point r € C? with |u — r| < 4e. Let U be the 
two-dimensional subspace spanned by u and r (if u and r are on a line through 
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0 then r € m and d(u, E) < 42). Let M be the line in U through r which 
crosses the half-line D from 0 through u at a point ¢ and is tangent to the unit 
disk B(O, 1) in U at a point w. Let z be the closest point to r on D. Then 
|r = z| < 4e. Let ¢ = Otw = Zztr. Since t € C’, |t| < 3 so sin¢ > 1/3 and 
|r — t| < 12e so |t —u| < |t-—r|+ |r —u| < 16e, proving the Claim for u 
and by symmetry for v. 


It will be shown that: 
(a) if u and v are points of a simplex with vertices w;, then 


Zu0v < max Zw;0w;. 
ij 


Here (a) will follow from 


(b) if u is fixed and v is in a simplex with vertices w;, then Zu0v < 
max; Zu0w;, since (b) could be applied in stages to prove (a). Further, it 
will be enough to prove (b) for a simplex reducing to a line segment from w1 
to w2, since then, the case of general v would reduce in stages to cases where 
v is on a boundary face of the simplex, then a lower-dimensional face, and so 
on until v is a vertex. So it will be enough to show that 


(c) For all u, x, y in R? and0 <A <1, 
Zu0(Ax + (1 — A)y) < max(Zu0x, Zu0y). 


Expressing (c) in terms of cosines, and then of scalar products and lengths, we 
can reduce to the case where |x| = |y| = |u| = 1, and then check the condition. 

Now itis easily seen using also Lemma 8.32 that all the hypotheses of Lemma 
8.34 hold, hence so do its conclusions. Next, Lemma 8.35 will be applied with 
again K = d, and with a, = u, a = v, and S := N’ N m2. It will be checked 
that the hypotheses of Lemma 8.35 hold. We have Za,aaz < 0.004¢!/?/(d — 
1). The angles Zuax and Zxqav both have the upper bound 0.002¢!/? /(d — 1). 
The other hypotheses of Lemma 8.35 follow from Lemma 8.34, so Lemma 
8.35 does apply. Its conclusion gives d(z, [u, v]) < ¢/(9(d — 1)). By induction 
hypothesis, d(y, N) < e(s — 1)/(9(d — 1)) for y = u, v, and then by convexity 
for any y € [u, v]. It follows that d(z, N) < es/(9(d — 1)), completing the 
induction and the proof of Lemma 8.36. 


Lemma 8.37 Let C € Cy. Then for any £ > 0 with e < 107! /(d — 1)”, there 
is an €/2-net for No-(C) containing at most ga(€) points where g4(€) := 
exp (a(d)e"—®/?) and a(d) = ka log 24 with ka as in Lemma 8.32. 


Proof. Recall that $212. gives an isometry for h of N2.(C) into NaC 2). Let 
N’ € N4-(C?) and let the vertices of the polyhedron N := 2,(N’) be y;, i = 
1,2,.... By Lemma 8.36, A(N, N’) < €/9. 
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For each i, on the half-line from O through y;, take the interval J; := 
[yi, yi + 12ey;/|y;|] of length 12e starting at y;. Then J; has an ¢/4-net J; 
containing 24 points (midpoints of subintervals of length ¢/2). 

By Lemma 8.30, for every M € N1-(C?), Te(M) has a vertex v; in J;, which 
is within £/4 of some u; € Jj. The (d — 1)-simplex with vertices v; is within 
€/4 for h of the one with vertices u;. The same is true of the d-simplices where 
the vertex 0 is adjoined to both. Thus if £, is the set of all polyhedra with one 
vertex in J; for each i, defined as in the definition of z,, then £s is ¢/2-dense 
in N4-(C2). The number of values of i is at most kae 79/2 by Lemma 8.32, 
and Lemma 8.37 follows. 


Now it will be shown that there are constants Kz, Lg < co such that, for 
0 < e < 1, there is an e-net for Cg containing at most fz(e) sets where 


fule) = Laexp(Kge"?”). (8.18) 


This will clearly imply the upper bound for h in Theorem 8.25. 

It will be enough (possibly changing K4 and L4) to prove (8.18) for € small 
enough, specifically for 0 < € < £ọ := 10-/(d — 1)’. For such e, and a(d) 
from Lemma 8.37, let Ky := a(d)/(1 — 2079/2). (L4 will be specified below.) 
We have the decomposition 


(0, co] =U, Ik, Te := (€0/2*, 80/2*-41. 


Before getting more precise bounds, it will help to see that N(e, Ca) < 00 
for all € > 0. The cube [—1, 1]? can be written as a finite union U;C; of cubes 
of side < ¢/d'/? and so of diameter < £. For each nonempty C € Cy, let B(C) 
be the union of all C; which intersect C. Then C C B(C) and h(B(C), C) < €. 
It follows that N(e, Cz) < œœ. 

So taking £ = €9/2, Ly can be and hereby is chosen so that (8.18) holds 
for € € Iı. Then (8.18) will be proved for ¢ € J by induction on k. Suppose it 
holds for all £ € J. Let ô € X41. Then ¢ := 26 € J. By induction, there is an 
e-net {C;} for Cy with at most fy(e) values of i. By Lemma 8.37, there exist 
at most g4a(€/2) sets K;, not necessarily convex, such that each A € N,(C;) 
is within ¢/4 of some K;. Thus there is a 5-net for \V,(C;) containing at 
most g4(ô) sets, and a ô-net for Cy containing at most gg(d) fa(28) sets. Now 
a(d) + 20-9)? K, < Ky by choice of K4, so gg(6) f4(26) < fy(6), and the 
upper bound in Theorem 8.25 is proved. 

Now to prove the lower bound, let p be the Euclidean metric on R? and 
recall that S?~! denotes the unit sphere {x € R¢: |x| = 1}. Let d > 2. If vq is 
the volume of B(0, 1) c Rf, then 


a ((S4')*) = val + 8) — 1] > due. 
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Then by the left side of the last displayed inequality in the proof of Lemma 
8.27, there is an ag > O such that 


D (e, St", p) 2 age! for 0<e<l. 


Given ¢, take a set {x;}/_, of points of S¢-! more than 2¢ apart, of maximal 


cardinality m := D(2e) := D (2e, sel. p). As above let Zabc denote the 
angle at b in the triangle abc. Then for i Æ j, 


0 := Lx;0x;j > 2sin $ > 2e. 


Let K; be the half-line from 0 through x;. Let C; be the spherical cap cut from 
the unit ball B4 by a hyperplane orthogonal to K; at a distance cos € from 0. 
Then the caps C; are disjoint. 

For any set Z C {1,..., m}let Dy := Ba \ UierC;. Then each D; is convex. 
Let àf be d-dimensional Lebesgue measure (volume). Then for all i, 1 (Ci) > 
baet! for some constant by > 0. By Lemma 8.8 there are at least e”/® sets 
T(j) such that for all j Æ k the symmetric difference /(j)AJ(k) contains at 
least m/5 elements. Then 


dM (DijyA Dia) > cael tet! = cae? 
for a constant cg := agbg2'—¢ /5, So for some By > 0, 
D (8, Ca, da) = exp (B8 2P) for O<5 <1. 


This finishes the proof of the lower bound for d, and so also for h, so Theorem 
8.25 is proved. 


Problems 


1. Let (K,d) be a compact metric space. Show that the collection of all 
nonempty closed subsets of K, with the Hausdorff metric, is also compact. 
Hints: Find an €-net in the collection of compact nonempty subsets of K by 
taking an e-net F in K, then taking all nonempty subsets of F. If H, is a 
Cauchy sequence of closed sets for the Hausdorff metric, let H be the set of x 
such that for some x, in H, there is a subsequence x, converging to x. Show 
that H is a nonempty compact set and that H,, converges to it for the Hausdorff 
metric. 


2. In the proof of Theorem 8.4, just after Lemma 8.6, show that the cubes can 
be ordered so that i = j — 1. 


3. If Proposition 1.16 is used instead of Proposition 1.12 in the proof of Lemma 
8.8, what is the result? 


4. Show that in Theorem 8.13(a), K(d, æ, K) cannot be replaced by I(d, a, K), 
or the collection of nonempty subsets of I(d, «œ, K), if d=2 anda > 1. 
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Hint: Let f((cos0, sin@)) := (1 — cos 0)/2, so f takes S! onto [0, 1]. Then 
I(Cf, 0)) = Ø where (f, 0)(u, v) := (f(u, v), 0). For any interval (a, b) C R 
there is a C% function gía b) > 0 with g > 0 just on (a, b). Show that for 
functions w = (f, T ôi gaib) (f)) for disjoint (a, i, bj) and small enough 
6; > 0, depending on k, ||W||, can remain bounded as k increases while Z(Y) 
can approximate any finite subset of [0, 1] x {0} for h. 


5. Show that the unit disk {(x, y) : x?” + y? < 1} belongs to a class Co of unions 
in Theorem 8.12 for k = 4, any a € (0, œo) and some K = K(a) < œœ. Hint: 
The function g(x) := (1 — x)!/? is smooth on intervals |x| < ¢ for ¢ < 1. 


6. Find a constant c > 0 such that lim inf, jo ¢ log N; (e, LL21, 23) > c,and 
likewise for D(e, LL2,1, d,). Hint: Consider squares along a decreasing diag- 
onal, S$; := Sjm4i-j, J = 1,..., m — 1, for the grid defined in the proof of 
Theorem 8.21. The union of U;<m+1-;j Sji and an arbitrary set of S; gives a set 


in LL21. Apply Lemma 8.8. 


7. (a) The proof of Theorem 8.21 used the inequality (a) < 4*. Show that, con- 
versely, for all k > 1, (7) > 4'k-!/? /3. Hint: Use Stirling’s formula (Theorem 
1.17). 


(b) Use this to give a lower bound for lim inf, 9¢ log D (e, LLai, h) . 


8. In the proof of Lemma 8.22, after Lemma 8.23, a function f is defined with 
I fll < (d — 1)!⁄. For d = 2, deduce this from monotonicity of x(-), y(-) in 
the proof of Theorem 8.21. 


9. Show that the lower layers in R? form a Glivenko—Cantelli class for any 
law having a bounded support and bounded density with respect to Lebesgue 
measure. 


Notes 


Notes to Section 8.1. Hausdorff (1914, Section 28) defined his metric between 
closed, bounded subsets of a metric space. 


Notes to Section 8.2. Rademacher (1919) proved his theorem on almost every- 
where Fréchet differentiability of Lipschitz functions on R“. Ziemer (1989) 
has an exposition. 

Kolmogorov (1955) gave the first statement in Theorem 8.4. The proof of 
that part as given is essentially that of Kolmogorov and Tikhomirov (1959). 
Lorentz (1966, p. 920) sketches another proof. Theorem 8.7 is essentially due 
to Clements (1963, Theorem 3); the proof here is adapted from Dudley (1974, 
Lemmas 3.5, 3.6). Remark 8.11, for which I am very grateful to Joseph Fu, 
shows that there is an error in Dudley (1974, (3.2)), even with the correction 
(1979), which is proved only for a > 1. Theorem 8.13 and its proof by a 
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sequence of lemmas are newly corrected and extended versions of results and 
proofs of Dudley (1974). Tze-Gong Sun (1976) first proved Corollary 8.20(a). 
The statement was apparently first published in Dudley (1978, Theorem 5.12) 
and attributed to Sun. There is a related technical report by Sun and Pyke 
(1982). 


Notes to Section 8.3. For Theorem 8.22 and its proof for d > 3 I am grateful 
to Lucien Birgé (1982, personal communication) for the idea of the transfor- 
mation U. For Theorem 8.21 I am much indebted to earlier conversations with 
Mike Steele. Any errors, however, are mine. 

A lower bound for empirical processes on lower layers in the plane will 
be given in Section 11.4 below, where P is uniform on the unit square. For 
other, previous results, e.g. laws of large numbers uniformly over LLa for more 
general P, and on the statistical interest of lower layers (monotone regression), 
see Wright (1981) and references given there. 


Notes to Section 8.4. Bolthausen (1978) proved the Donsker property of the 
class of convex sets in the plane for the uniform law on J? (cf. Corollary 
8.26(b)). 

Theorem 8.25, as mentioned, is due to BronStein (1976). Specifically, the 
smoothing C +» C” and the set of lemmas used follows mainly, though not 
entirely, his original proof. Perhaps most notably, it appears that the polyhedra 
used to approximate convex sets in BronStein’s construction need not be convex 
themselves, and this required some adjustments in the proof. A more minor 
point is that if C € Cy, then C! does not necessarily include B(0, 1), although 
C? does. So Bron&tein’s Lemma 1 seems incorrect as stated but could be 
repaired by changing various constants. 

In an earlier result of Dudley (1974, Theorem 4.1), for d > 2, in the upper 
bound e"-®/? was multiplied by | log £|. This bound, weaker than Bronitein’s, 
is easier to prove and suffices to give Corollary 8.26(b) and (c). The lower bound 
in Dudley (1974) was reproduced here. I thank James Munkres for telling me 
about the triangulation method used after Lemma 8.32. 

Gruber (1983) surveys other aspects of approximation of convex sets. 
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The Two-Sample Case, the Bootstrap, 
and Confidence Sets 


9.1 The Two-Sample Case 


Let X1,...,Xm,---,V1,.--,Yn,-.-, be some random variables taking val- 
ues in a set A where (A, A) is a measurable space. Thus (X1, ..., Xm) and 
(%,..., Yn) are “samples,” of which we have two. Let 


1 m 1 n 
Py := Ti Qn := pee 


The object of two-sample tests in statistics is to decide whether P, and Q, are 
empirical measures from the same, but unknown, law (probability measure) P 
on (A, A). Since P is unknown, we cannot directly compare P,, or Q, to it by 
forming m!/?(P,, — P) orn'/?(Q,, — P). Instead, P,, and Q, can be compared 
to each other, setting 


ph cs ( za J ae On). 


m+n 


The basic hypothesis will be that there are two laws P, Q on (A, A) and a 
product of two countable products of copies of (A, A) with factor laws P and 
Q respectively, namely, 


(2, D, Pr) = (Q1, Bi, Pri) x (Q2, B2, Pro) 
where 
(21, By, Pr) = | (Ai, A, P), (&, Bo, Pro) = | [Bj A, Q), 
i=l j=l 


and each A; and B; is acopy of A. On these products let X; be the A; coordinate 
and Y; the B; coordinate. If P is a class of laws on (A, A), the (P) null 
hypothesis is that in addition, P = Q € P. A class F of measurable functions 


319 
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on (A, A) will be called a P-universal Donsker class if it is a P-Donsker class 
for every P € P. 


Theorem 9.1 Let F be a P-universal Donsker class of functions on (A, A). 
Then for each P € P, under the (P) null hypothesis, Vn, n => Gp asm, n > ©. 


Proof. Let P € P. By Theorem 3.30, it suffices to show that (vm, n, Gp) > 0 
as m,n — oo. Since F is P-Donsker, it is P-pregaussian, so that by Theorem 
3.2 there exists a coherent G p process on F. Since F is pp-totally bounded, the 
functions f œ> Gp(f)(@) on F belong to the space So := UC(F, pp) of all 
pp-uniformly continuous and thus bounded functions on F. Here UC(F, pp) is 
a separable subspace of the Banach space (€°(F), || - ||), because uniformly 
continuous functions on a totally bounded metric space extend uniquely to 
functions in the space C(K) of continuous functions on the compact completion 
K, and C(K) is separable in the sup norm for every compact metric space K: 
RAP Corollary 11.2.5. The map w +> G p(-)(@) is Borel measurable from the 
underlying probability space into Sp: to see this, it is enough by second- 
countability (e.g., RAP, Proposition 2.1.4) to show that the inverse image 
of an open ball {f : || f — fol F <r} is measurable, which is true since on 
UC(F, pp) the supremum can be taken over a countable pp-dense set in F. 
Thus Gp has a law (Borel probability measure) uo on UC(F, pp), as in 
Theorem 2.32. It is easily seen that every coherent Gp process on F has the 
same law on UC(F, pp) (consider finite subsets increasing up to a countable 
pp-dense subset of F). 

Since F is a P-Donsker class, by Theorem 3.24 there exist probability 
spaces (V;, Si, T;), i = 1,2, perfect measurable functions g,; from V; into Q; 
such that t; o g = Pr; for all n andi = 1, 2, and coherent G p processes GË 
on V; for i = 1, 2 such that m!/2(P,, — P) o Em1 > GP almost uniformly for 
I|- I| z asm — oo and n! (Qn — P) o 8m2 > G? almost uniformly for || - || F 
as n — oo. Form the product (V, S, t) := (Vi, S1, T1) x (V2, S2, T2) and 
define Amn : V œ> Qi X Qa by Amn(u, vV) := (8m1ı(u), 8n2(v)). Then t o hzl = 
Pr; xPrz for all m and n. To show that Amn is perfect, the proof of Theorem 
3.24, to be called the “1-sample” proof, will be adapted to the 2-sample case. 

Apply Theorem 3.24 with (A”, P”) in place of (X,, Qn) and with the nota- 
tion t instead of Q. Take the Cartesian product of two copies of the probability 
space (2 in the statement, which will be (Vj, t1) x (V2, t2). Apply the con- 
structions in the 1-sample proof in defining the two spaces, specifically the 
definitions T, = A” x I, for n > 1 and Tọ = So x Jp where each J; is a copy 
of [0, 1]. 

In the last passage of the 1-sample proof, showing that g, are perfect, for 
given m and n we mainly want to show that hmn is perfect form > 1 andn > 1. 
The case where just one of m or n is 0 is not needed. 
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Take a set B C Vi x V2 with (t x T)(B) > 0. Write points of V; = 
[ise T, as (x,&)€ Vi and (y, 7) € Vo where x,y € To and €,n € T := 
es T,. Then 


- J / J J Ia, £, Y, Mdp (Edp Mdudu) 


for p, and uo as in the 1-sample proof. Choose and fix an x and y such that 


T f J La(x, E, y, n)dpg(E)doy(n). 


In the one-sample proof write px = Æmx X Bmjx rather than Pj,» X Qmx. Then 
the latter double integral equals 


/ J / J EE Ene E aT a a Bry (1") 


where Em ranges over Tp = A” x Im, &” over [] eer T,, and likewise for nn 
and 7”. Choose and fix €” and 7” such that 


0 < | f latm E" 35 im ddaa Endi 


Let Em = (u, v), u € A”, n, = (w, z), w € A”, z € I,. Recalling that Q, in the 
1-sample proof is here set equal to P”, we get as in that proof 


E I I J J Lp(x, u,v, y ar dd P de 


Choose and fix v and z so that 


0< f f isu, 0,8", y, tnd P™ud PM). 


Let C := {(u, w) € A” x A”: (x,u, v, E", y, w, z,n”) € B}. Then (P” x 
P”)(C) > Oandh,,,[B] D C, showing that h,nn is perfect by Theorem 3.18(b). 


We have 
(mn)!/2 
Van = mya” P (Qn P)) 
n Alp, m \1/2 
E Pa (Pn — P)-(]) 1/2 _ Pp). 
(7) (Pn — P)— (5) nn - PY 


Let G2” := (n/m +n) GP — (m/(m +n))'G®. Now, if Y, Z are 


independent coherent Gp processes on F and a, b are any real numbers with 
a? +b? = 1, then aY + bZ is a coherent Gp process on F. To see this, note 
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that aY + bZ is a Gaussian process indexed by F, has the covariances of G p, 
and is coherent. So each er is a coherent Gp process. We have 


n 1/2 
[önn © hnn = GPP < rarer IEn = P) o gm = GPI 
m+n 


+ (1) mo, - Poga- GP lz > 0 
m+n 
almost uniformly as m,n —> oo. 
Let H be a function on €°(F) with ||Ħ ||} < 1, so that || H |lsu < 1 and 
|H(f) — H(g)| < || f — gll forall f, g € L” (F). Given € > 0, there are mo < 
oo and ng < œ such that for m > mo andn > no, ||Vm,n © Aman — Gos zE 
on a set V; C V with u(V:) > 1 — e. It follows that 


dmn = |H (Vm,n O hmn) m H(GG™®)] <E 


on V,, and dm n < 2 everywhere. Since GY has a law defined on the Borel 
sets of UC(F, pp), thus of £”(F), for || - Iz, H(G") is measurable and 


f H (Vm,n O hmn)du = EH(GS"”) < 3e. 
It follows that 6(Vm,n © Amn, Gp) > 0 as m,n — ov. Since each Amn is per- 


fect and measure-preserving, it follows that vm „=> Gp for m,n — oo as in 
Theorem 3.20. 


The classical two-sample situation is the special case where A = R, Ais the 
Borel o -algebra, F is the set of all indicator functions of half-lines (—oo, x], and 
P is either the set of all laws on R, or the set of all continuous (nonatomic) laws. 
Thus m!/2(P,, — P)((—oo, x]) = m'/2(F,, — F)(x) where F is the distribution 
function of P and F,, an empirical distribution function. Here F is a universal 
Donsker class by any of several previous results, for example, Corollary 6.20, 
and Gp(1(~co,x}) = yYr&œ) Where y is the Brownian bridge (as in Chapter 1). 
Actually, F is a uniform Donsker class as defined in Section 10.4. Without 
using any machinery, one can see directly that the convergence to yr) is as 
fast for any distribution function F as it is for the U[0, 1] distribution function. 
If F is continuous, it takes all values in the open interval (0, 1), yo = yı = 0, 
and y, > Oast | Oort + 1. Thus the distributions of sup, yrFœ) and sup, |yrœ)l 
and the joint distribution of (inf, YF(u), SUP, YF@œ)) do not depend on F, for F 
continuous. Let F and G, be independent empirical distribution functions for 
the same F. Let Hmn := (mn/(m + n))'/?( Fn — Gn). By Theorem 3.31 and 
Proposition 3.32, which extend straightforwardly to limits as m, n —> oo, the 
distributions of the supremum, supremum of absolute value, and supremum 
minus infimum of Hmn converge to those of the same functionals for y,. Thus 
we get for u > 0 (about u = 0, see Problem 3): 
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Corollary 9.2 If Fm and G, are independent empirical distribution functions 
for a continuous distribution on R, then for any u > 0, 

(a) Him noo Pr(sup, Amn(x) > u) = exp(—2u’), 

(b) iMm n> Pr(sup, |Hinn(x)| > u) = 2X g1)! exp(—2k7u7), 


(c) iMm noo Pr(sup, Hmn (x) — infy Hmn) > u) = 2072, (4k’u? — 1) 
exp(—2k?u?). 


Proof. The distributions of the given functionals of the Brownian bridge y, are 
given in RAP, Propositions 12.3.3, 12.3.4, and 12.3.6. All three are continuous 
in u for u > 0. Thus convergence follows from RAP, Theorem 9.3.6, adapted 
to limits as m,n — oo. 


The quantity sup, Hinn(x) — infy Hmn(y) on the left in (c) is called a Kuiper 
statistic. Suppose we have samples on the unit circle in the plane. The unit circle 
could be written as the set of (cos 0, sin 0) for0 < 6 < 27 or for =r <0 <x. 
Using, for example, the Kolmogorov—Smirnov statistic as in (b), the value of 
the statistic would depend on the choice of range for 6, whereas the Kuiper 
statistic is rotationally invariant (this is left as Problem 1 at the end of the 
chapter). 

For multidimensional sample spaces, one in general does not know specific 
statistics which, like those in Corollary 9.2, have given asymptotic distributions 
not depending on P in a large class, such as all nonatomic laws. 


9.2 A Bootstrap Central Limit Theorem in Probability 


Iterating the operation by which we get an empirical measure P,, from a law 
P, we form the bootstrap empirical measure P? by sampling n independent 
points whose distribution is the empirical measure P,,. The bootstrap was first 
introduced in nonparametric statistics, where the law P is unknown and we 
want to make inferences about it from the observed P,,. This can be done by 
way of bootstrap central limit theorems, which say that under some conditions, 
including n large enough, n'/?(P2 — P,) behaves like n'/?(P,, — P) and both 
behave like Gp. 

Let (S, S, P) be a probability space and F a class of real-valued measurable 


functions on S. Let as usual X1, X2, ..., be coordinates on a countable product 
of copies of (S, S, P). Then let X3,,..., XB, be independent with distribution 
P,,. Let 


1 n 
PP e ) bys. 
n nj 

j=l 
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Then PË will be called a bootstrap empirical measure. More precisely, take 
the product of (S°, P®) with a probability space (Qg, Pr”) on which for 
all n = 1,2,..., 1.i.d. random variables i,),..., inn are defined with uniform 
distribution on {1, ...,}. Then set xe = Xi jo} 

A statistician has a data set, represented by a fixed P,,, and estimates the 
distribution of PË by repeated resampling from the same P,,. So we are inter- 
ested not so much in the unconditional distribution of PP as P, varies, but 
rather in the conditional distribution of pR given P, or (X1, ..., Xn). Let 
vB := n!?(PP — P,). 

The limit theorems will be stated in terms of the dual-bounded-Lipschitz 
“metric” 6 of Section 3.6, which metrizes convergence in distribution for not 
necessarily measurable random elements of a possibly nonseparable metric 
space (S, d), to a limit which is a measurable random variable with separable 
range. Let By be the £ distance where d is the metric defined by the norm 


I- Ilr. 


Definition. Let (S, S, P) be a probability space and F a class of measurable 
real-valued functions on S. Then the bootstrap central limit theorem holds 
in probability (respectively, almost surely) for P and F if and only if F is 
pregaussian for P and 6 (v3, Gp), conditional on X;,..., Xn, converges to 0 
in outer probability (resp., almost uniformly) as n — oo. 


To make the conditioning more explicit, given X,..., Xn, there are finitely 
many possible values for v2, each on a measurable set in Qg. For any 
bounded Lipschitz function H on €(F), E(H(v3)|X1, ..., Xn) is defined, 
and E H(G p) is some number. Thus 


sup{|E(H(v2)|X1,..., Xn) — EH(Gp)|: ||Allac < 1} 


is a function of X;,..., Xn, not necessarily measurable, which according to 
the definition converges to 0 in one sense or the other. 


Thus, the bootstrap central limit theorem holds in probability for F if 
and only if for any £ > 0, E*(Pr(Br(v2, Gp) > eX; F=) —> 0asn > œ, 
where the outer expectation is taken with respect to the distribution of 
Xise Xn. 

A main bootstrap limit theorem in probability will be stated. It will be proved 
in the rest of this section, via a number of lemmas and auxiliary theorems. 


Theorem 9.3 (Giné and Zinn) Let (X, A, P) be any probability space and 
F a Donsker class for P. Then the bootstrap central limit theorem holds in 
probability for P and F. 
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Remarks. Giné and Zinn (1990), see also Giné (1997), also proved “only if” 
under a measurability condition and proved a corresponding almost sure form 
of the theorem where F has an £? envelope up to additive constants. 


Proof. E®, Pr®, and £? will denote the conditional expectation, probability, 
and law respectively given the sample X™® := (X1,..., Xn). Given the sample, 
vë has only finitely many possible values. 

First, a finite-dimensional bootstrap central limit theorem is needed. 


Theorem 9.4 Let X;, X2, ... be i.i.d. random variables with values in R? and 
let XB,, i=1,...,n, be iid. (Py), where P, := +071 ôx, Let X, := 
5 jel Xj. Assume that E|X,|? < œ. Let C be the covariance matrix of 
Xi, Crs t= E(X1,- X15) — E(X1,)E(X15), r,s = 1...,d. Then for the usual 


convergence of laws in R“, almost surely as n —> ©, 


L? |n PER, —X,) | > NOC). (9.1) 


> 


Proof. Note: N (0, 0) is defined as a point mass at 0. Suppose the theorem holds 
for d = 1. Then for each t € R, the theorem holds for the variables (t, X 3) 
and thus (t, X a j) and so on. This implies that the characteristic functions of 
the laws on the left in (9.1) converge pointwise to that of N (0, C), and thus 
that the laws converge by the Lévy continuity theorem (RAP, Theorem 9.8.2). 
(This last argument is sometimes called the “Cramér—Wold” device.) So, we 
can assume d = 1, and on the right in (9.1) we have a law N(O, o?) where o? 
is the variance of X1. If o? = 0 there is no problem, so assume o? > 0. We can 
assume that EX, = 0 since subtracting EX, from all X; does not change the 
expression on the left in (9.1). Let 


[2 a 1 y y2 T a fel 21/2 

s = 2 =Ky. S = Wy”. 
Then by the strong law of large numbers for the X; and the X os s/? converges 
a.s. as n —> œ toa”. 
For a given n and sample X o X, and s}, are fixed and the variables 
Yaris (X2, — X,,)/n'/? fori = 1,...,n are i.i.d. We have a triangular array 
and will apply the Lindeberg theorem (RAP, Theorem 9.6.1). For each n and i, 
E? Yni = 0. Each Y,,; has variance Var? (Yni) = s)? Jn. The sum from i = 1 
to n of these variances is ie Let Zn; := Yni/s,. Then Z,; remain iid. 
given P,,, and the sum of their variances is 1. We would like to show by the 
Lindeberg theorem that £? (ai Zni) converges to N(0, 1) as n > œœ, for 
almost all X1, X2, .... We have E? (Zi) = 0. Let 1{.....} = 1....} for any 


event {.... .}. Fore > Olet E,je := EF(Z2 1{lZn;l > e}). It remains to show 
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that X zi Enje = nEņn1e > 0 as n > oo, almost surely. Now 


Ani (= {|Zm] > £} = {Xf — Xnl > 2/25/04, 


nl 


which for n large enough, almost surely, is included in the set 
C(n) := Cr = {RF ler week: 


Also, Z2, = (X2, —Xn)P2/(nsi2) < 2X2)? + Ki ]/(ns!2). Then X,,/s/?, 
which is constant given the sample, approaches 0 a.s. as n — ov, and so does 


= =2 
EP (1cm)X,/8)") < X,,/s)2. For the previous term, 


EP (XE lem) (sy = n! ÈO XPAIX;| > n'?oe/2}/s)’, 

i=l 
which goes to 0 a.s. as n — ov, by the strong law of large numbers for the 
variables X?1{|X;| > K} for any fixed K, then letting K —> oo. Theorem 9.4 
is proved. 


Here is what is called a desymmetrization fact: 


Lemma 9.5 Let T be a set and for any real-valued function f on T let || f \|r 
:= SUp;er | f(t)|. Let X and Y be two stochastic processes indexed by t € T 
defined on a probability space (Q x Q’, S & S', P x P'), where X(tX@, œ) 
depends only on w € Q and Y(t)(w, w’) only on a € Q'. Then 


(a) For any s > O and any u > 0 such that sup er Pr{|Y(t)| > u} < 1, we have 


P*(|Xllr >s) < Př{IX — ¥llr >s — u}/[l — ap P'(\Y(t)| = w). 


(b) If 8 > super E(Y(t)), then for any s > 0, 
P*(|X|lr > 8) < 2Pr*(|X — Yilr > s — (20)'”). 


Proof. Let X(@)(t) := X(t)(@) and A(s) := {@: ||X(@)||r > s}. For part (a) 
we have by (3.3) in Theorem 3.9 


P(X —Yllrp > s—u) > EPX — Yllr > 5 — u) 
= P*(|X\lr >s) inf (PY UX@) —Y\|lr >s—u) 
wWEA(S 


IV 


P*(\|X llr > s) inf P(\Y(@| < u), 


since if ||X(œ)l|lr > s, then for some t € T, |X(@)(t)| > s. This proves 
part (a). Then part (b) follows by Chebyshev’s inequality. 


Recall that £; are called Rademacher variables if Pr(e; = 1) = Pr(e; = 
—1) = 1/2. Some hypotheses will be given for later reference. Let (S, S, P) 
be a probability space and (S", S”, P”) a Cartesian product of n copies of 
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(S, S, P). Let F C £2(S,S, P). Let £1, ..., €n be iid. Rademacher variables 
defined on a probability space (2’, A, P’). Then take the probability space 


(S x V, S” @ A, P” x P’). (9.2) 


References to (9.2) will also be to the preceding paragraph. 
Here is another fact on symmetrization and desymmetrization. 


Lemma 9.6 Under (9.2), for any t > Qand n = 1,2,..., 
(a) P (I Eai Ef XDF > £) < 2maxcen P (1 Di FADIA > 1/2). 
(b) Suppose thata? := supper S(f — Pf)’dP < 00. Then for t > 2P an? 


and alln = 1,2,..., 
Ya- Pp} > ‘ 
F 


Pr* ( 
i=1 
>(t— n'ay] . 
F 


4Pr* | Yo ei f(Xi) 
i=1 


Proof. For part (a), let E(m) := {t := {u}: t; = +1 foreach i}. Let 
En := {é;}7_,. Then 
F 


Yo ei f(Xi) 
isi 
dD Prt] Ee as{ E J= E o >t 


Pr* ( 
TEE(n) T=1 ti=—1 F 


C, 


IA 


P,: 


IA 


By Lemma 3.6 and Theorem 3.3, for laws u, v and sets A, B, 
(u x v)"(A x B) = p*(A)v*(B) = u(A)v*(B) 
if A is measurable, so 


k 
P, <2 D7 2" max(P")*(I|) PADNA > t/2) 


TEE(n) i=1 
k 
= 2 Pr“ Xi t/2). 
max KOZ Mhz > 1/2) 


This proves (a). For (b), take a further product, so that we have (9.2) with 2n 
in place of n. Apply Lemma 9.5(b) to the process f > )7"_, f(X;), letting 
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Y; := Xn4i fori = 1,...,n. We get for any t € E(n), 


PA YO AXD- File > t- 2n)'7a) 


i=1 


C, 


IA 


= 2(P7")*(| > ti(f(Xi) — fD >t - (2n) Pa). 


i=1 


Then by Theorem 3.9, 


< 2EAN YO EXD FOIF > t — 2na) 


C, 
i=1 
< APEU YO ei fX > Ut — 2n)'70)/2), 
isl 
finishing the proof. 


Next, we have some consequences or forms of Jensen’s inequality. 


Lemma 9.7 (a) Let (S, S, P) be a probability space and F C L!(S, S, P). 
Then || Ef llr < EIFI 

(b) Let T be a set and (Q', S', Q’) and (Q”, S", Q") two probability spaces. 
Take the product probability space (Q' x Q", S' & S", Q' x Q"). Let g(t, o’) 
be a real-valued stochastic process indexed by t € T with œ € Q', and let 
h(t, w") another such process with w" € Q". For any real-valued function f on 
T let || flir := sup;er | f(O|. Assume that Eo(h(t, -)) = O for all t € T. Then 
forl < p < œ 


E* lel} < E*lg + hl} (9.3) 


Proof. (a) For each g € F, by Jensen’s inequality (e.g., RAP, 10.2.6), 
|Eg| < Elg|, while E|g| < E|| fI}. Then, taking the supremum over g € F 
proves (a). 

For (b), recall that for any function of w', E* for Q’ and Q’ x Q” are the same 
by Lemma 3.6 and Theorem 3.3. For any values of t and w’, z œ> |g(t, œ) + z|? 
is a convex function of z. By Jensen’s inequality again, 


Ig(t, wP = |80, w) + Eh(t, |? < fise othe oaao 


IA 


sup f Ig(s,w') + h(s, w^) PdO" œ"), 
seT 
and so 


18C, oI? < sup I E + h(s, Pd O"(o"). 
re 
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*p 


For all s € T, |g(s, w) + h(s, @)|P < ||gC, o) + AG, w" |7, so 
Epligllt < EpEollg + hil7? = E* lg + hl? 


by the Tonelli—Fubini theorem since ||g + A||} is measurable and by Theorem 
3.3 (E* f = Ef* < +00, f > 0). 


The following lemma and theorem are known as Hoffmann-Jérgensen 
inequalities. 


Lemma 9.8 Let (S", S”, IT; Pj) be a product probability space with coordi- 


nates X1,..., Xn. Let F be a class of measurable real-valued functions on 
(S, S). Let S( f) := Xi f(X;) fork =1,...,n. Then for any s > 0 and 
t>0, 

Pr (maxgen [|Sk(f II > 3t + s) 0.4) 

* 2 j 
< (Pr {maxk<n || Sell > t}) + Pr (max jen ||Xj ll > s). 

Proof. Let ||| := ||- |z. Lett := min{j <n: |S; >t} ort := n+1 
if there is no such j < n. Then for j = 1,...,n, {t = j} is by Lemma 3.6 
a measurable function of X1, ..., Xj and {maxg<p ||Skl|* > t} is the disjoint 


union Ut {tT = j}. On {t = j}, we have || S,||* < tif k < jand when k > j, 
WSell* < t+ IX; ll + |] S; — S;||*. Thus in either case, 


max ||Skl|* < t+ max ||X;||* + max ||Sk — S; l”. 
k<n i<n t<k<n 


For j < k <n, by Lemma 3.6, ||Sk — S;||* is (a.s. equal to) a measurable 
function of Xj+41,..., Xg and thus is independent of {t = j}. So 


Pr(t = j, max || S;||* > 3t + s) 
k<n 
< Pr(t = j, max ||X;||* > s) + Pr(t = j)Pr( max || S; — S;||* > 22). 
i<n j<ks<n 


Since maxj<k<n || Sx — Sj ||“ < 2maxg<y ||S,||*, summing over j = 1,...,n 
gives (9.4), and the Lemma is proved. 


Theorem 9.9 Let0 < p < w,n=1,2,..., let X1,..., Xn be coordinates on 
a product probability space (S", S”, M P;). Let F be a class of measurable 
real-valued functions on (S, S) such that fori =1,...,n, EISFAIR < œ. 


Let 
> ] < ean] : 
F 


) < 2-4? Emax FX DIP) + 24)”. 


u := nrf, >0: Pr [max 


k 
Yra 
i=1 


Then 


E max 
k<n 


*p 


r 
Da» 
i=1 


F 


P1: KpB Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 
CUUS2019-09 CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 15:40 


330 9 Two Samples and the Bootstrap 


Proof. Recall that for any random variable Y > 0 with law P, 


[ P(Y > tdt = f f aroa = f f iaro 
0 0 t+ 0 0 


[o0] 
= f ydP(y) = EY. 
0 
Let |I| := II+ Il. Then 


[0,0] 
E(max |S, ||?) = arf Pr(max ISl“ > 4t)dt? 
<n 0 (<n 


u [0.6] 
= 4’ (J + / ) P (qax sil > 4r) ar”, 
0 u sn 


which by (9.4) with s = ¢ is less than or equal to 


oo [e6] 
(4u)? +e f (Pr(max I|Sell* > 0)? dt? +e f Pr(max || X;||* > t)dt? 


CO 
< (4u)? + 4? Pr(max ILS“ > wf Pr(max | Sk ||* > tat? 
(<n 0 <n 


+4? E max || X;||*? 
i<n 


1 
£ (u) + 5E max || Sq |"? ] + 4° E | max |X; ||"? }, 


since 4? Pr(maxg<» ||Sk||* > u) < 1/2 by choice of u. Then since E(maxg<n 
|| Sz ||*?) < co, we can subtract the term with factor 1/2 from both sides, then 
multiply by 2 and get 


E(max || Sill") < 2(4u)? + 2-4? E(max || X; ||"). 


The theorem is proved. 


Next is another symmetrization-desymmetrization inequality: 
Lemma 9.10 Under (9.2), 


se E a es F(X) — Pile = ENE a AXD PAIF 
(9.5) 
< 2E*|| Vin GEX) -PAIF 


which also holds if the Pf in the last expression is deleted. 
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Proof. Replacing F by {f — Pf: f € F}, we can assume Pf = 0 for all 
f € F. Then by (3.4), 


E*| le fXPle=EcEZ DD af YO afls 


j=l i: e;=1,i<n i: €;=—l1l,i<n 


< EEX SO afle + BeE Do afr 


i: g=l isn i: g=—1,i<n 


< 2E EXI $O AADA = 2E" Y AXA 


j=1 j=l 


by (9.3). The left side of (9.5) follows. For the right side, extend (9.2) by 
taking the Cartesian product with another copy of (S”, S”, P”) and let the new 
coordinate functions be €,,..., €n. Then if t; = +1 for each i, 


E*| YO EXD < EMD FX) — FED Ilr 


j=l j=! 


= EIS YEUX) — FED 


j=l 
since (9.3) holds as well if + is replaced by —, and by symmetry of f(X;) — 
f(&)). So by (3.3), 


EY SODA < EEX XD- FED 


i=i i=1 


2E EXI ei f(Xdlle = 2B" ef ADIF, 


i=1 i=l 


IA 


which finishes the proof of the Lemma. 


Next will be some Poissonization facts. A stochastic process &(t, œ) with 
values in R” is called centered if E&(t) = 0 for all t € T. Recall that Y has a 
Poisson distribution with parameter à > Oif and only if Pr(Y = k) = e~*A*/k! 
fork =0,1,2,.... 


Lemma 9.11 Let T be any set. Let & := {X;,:}'_,(t, @), t € T, be a centered 
stochastic process with values in R” with w, € Qı for a probability space 
(21, Sı, Q1). For each j > 1 let (Qj, Sj, Q;) be a copy of (Q1, S1, Q1). Take 
the product (Q, S, Q) = I, Sj, Q;). Letrj(w) := wj be the coordinate 
projection from Q onto Qj. Define stochastic processes {X;,;}/_, for j = 2 by 
Xi j(@) := Xii;(@)) fori =1,...,n andall j > 1. 

Let N; := N(i), i=1,2,..., be i.i.d. Poisson variables with parameter 
A = I, defined on a probability space (Q', P"), and take the product of this space 
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with Q, so that N1, Nz, ... are independent of X; j. Let || f\lr := sup;er If O)| 
andx ^y := min(x, y). Then 


n N(i) 


< i ox - (9.6) 


T i=1 j=l 7 


E* 


n 
yx, 
i=l 


Proof. By Jensen’s inequality, Lemma 9.7(a), letting Ey be expectation with 
respect to the Poisson variables N;, we have 


n 
X Xi 
i=l 


e-—1l 


Mn := E* 


e 
T 


n 


JOEN: A DXi 


i=1 


n 


XON: A DXi 


i=l 


= Eš < EEn 


T 


T 


Then by (3.4), M, < EnE%|| (N; ^ 1)X;illr. For each i and given 
N; = NG), 


N(i) N(i) 


5 Xij (Ni ^ DXi = XY Xij 
j=1 j 


(=0 for N; <1), which is a centered process independent of (N; ^ 


1)X;ı. For any fixed values of Nı,..., Nn, apply another form of 
Jensen’s inequality, Lemma 9.7(b), with p = 1, X := Qi, Q” := L22 2j, 
g(t, o!) = (N; A^ 1)X;, (t), and A(t, o”) = 79 Xij). We get M, < Ey E4 


DIS Do Xi jllr. Then by (3.4), Ey EX can be replaced by E*, and the 


Lemma is proved. 


For any two finite signed measures u and v on the Borel sets of a separable 
Banach space B recall that the convolution u * v is defined by (u * v)(A) := 
J u(A — x)dv(x) for any Borel set A. Here convolution is commutative and 
associative. For any finite signed measure u on B and k = 1,2,..., let y* 
be the kth convolution power u *--- * u to k factors. Let e” := exp(u) := 
Vo ut/k!. Let u? := ôo. If u > 0 let Pois(u) := e “Pe. If u and v 
are two finite measures on B, it is straightforward to check that Pois(u + 
v) = Pois(jz)*Pois(v). If X is a measurable function from a probability space 
(Q, S, P) into a measurable space (S, A), recall that the law of X is the image 
measure £(X) := Po X`! on A. For any c > 0 and x € B, Pois(cé,) = 
L(N-x) where Ne is a Poisson random variable with parameter c. 

If £(X;,;) =i for j =1,2,... and N; := N(i) are Poisson with 
parameter 1, where all X; j and N, are jointly independent, i,r = 1,...,k, 
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k N(i) 


LID > Xi = Peis (Soa). 


i=l j=l 


If X; are independent random variables with values in a separable Banach space 


with EX; = Oforeachi = 1,...,, then the depoissonization inequality (9.6) 
gives 
n e ; n 
E 2X < 5] \|x|| dPois 2 &X;) : (9.7) 
j=l j=l 

Here is another depoissonization inequality. 
Lemma 9.12 Let (B, || ||) be a normed space. For each n = 1,2,..., let 
Vi, ..., Vn ben distinct points of B and v := (vi +--+ vn)/n. Let Vi, ..., Va be 
i.i.d. B-valued random variables with Pr(V; = vj) = 1/n fori, j =1,...,n. 
Let Ni, ..., Nn be Poisson variables with parameter I. Let all V; and N; be 


jointly independent. Then 


n 


EIE -v| < E 2 == . (9.8) 


j=l 


Proof: The variables V; and N; are discrete, so the norms in (9.8) are both 
measurable random variables. Next, with v(j) := vj, 


Sey, — v) = nlL(V, — v) = X bug. 
j=1 j=1 
It follows from the above properties of Poissonization that 
Pois (Sa L(V; — v)) = Pois(dyq)-v) * ++- * Pois(Sy~n)-v) 
= L (Eia Nie —v) = LEN; — DO; -= v). 
Then by (9.7) for X; = V; — v, the Lemma is proved. 


(9.9) 


The next proof will use the fact that for any real-valued function f on a 
measure space, l{ f>) = (l{f>:))*. This is true since 1,3; < l{f*>r, where 
the latter is measurable, and Lemma 3.8(a) applies. The next fact is about 
triangular arrays with i.i.d. summands. 


Theorem 9.13 Let (T, d) be a totally bounded pseudo-metric space. For each 
n=1,2,..., suppose given a product of n copies of a probability space, 
Q = (Qa, An, P™)", For each n, let Y, be be a real-valued stochastic 
process indexed by T on Q,. For œw := lo; Y= € Q” let X, j(@) := Y,l@j;). 
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Thus X,,; are i.i.d. copies of Y,. Suppose that each Y, has bounded sample 
paths a.s. Assume that 


(i) For all t in a dense subset D C T for d and for all B > 0, 
lim aP {Xa (| > Bn'/?} = 0, (9.10) 
noo 


(ii) For any ô > 0, sup,er Pr{|Yn(t)| > ôn!?} —> Oasn— œ, and 
(iii) for alle > 0, as ô J 0, 


n 


timsupPr* fn! sup |X Xni- EXni0) 
noo d(s,t)<6 i=l 
—X, (8s) + EX) i(s) > e} —> 0. 
Then for any y > 0, 
lim nPr* {|Xnillr > yn} = 0. (9.11) 
n> 


If the hypotheses hold only along a subsequence nx, then the conclusion also 
holds along the subsequence. 


Proof. Taking more Cartesian products, let {V,;} be an independent copy of 
{Xr j},n =1,2,..., j=1,...,n. For any m = 1,2,...and u > 0 we have 
easily 


P {Xn — Vaal > au} < 2Pr* {| Xnal] > 4/2}. (9.12) 
Then, condition (iii) implies that as ô | 0, 


Se Xn i(t) = Vi i(t) 
— Xn. i(s) + Vii (s) 


lim sup,,_, Pfa SUP avs. < 
d ae (9.13) 


> e} > 0. 


Let Un; := Xn,j — Va, j. For any t > 0, let {s),..., SN@)} be a maximal 
subset of T with d(s;, sj) > t/3 fori A j.Let B; := {t € T : d(t, si) < t/3} 
and C; := Ciz := Bi \U,<;B,. Then C; fori = 1,..., N(v) are disjoint sets 
of diameter < t whose union covers T and such that C;,, N D Æ Ø for each 


i. Choose t; € Cj,, for each i and let Un, j,t(t) := Un,j(ti) for each t € Ciz. 
Then 
Pr* {|Unallr > 2yvn} < P {Unacllr > yva} 
(9.14) 
+ Pr“ fUn, E Unalir > y/n} : 
Then, as n —> ov, 
nPr* lUn, 1, llr > yJ/n 
(9.15) 


< DMP abr {Unat > yya} > 0 
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by hypothesis (i) and (9.12). For an arbitrary set A in a probability space, let A* 
be a measurable cover, so that A* is a measurable set and (14)* = 14+. Then 
it is easily checked that for any sets A1, ..., Ax, (ee Aj)* = UE A* up 
to sets of probability zero. Take A; := {|U}, i — Uni.cllr > yn'/*}. Then by 
Theorem 3.3 and Lemma 3.6, 


1 —exp(—nPr*(A,)) < 1—(1—Pr*(A,))" = Pr(U"_, 4%) 


Pr* | max On — Unicllr > yva . 


l<i<n 


Let Sk := ya Uni — Un.i,r- Then the last expression in the preceding display 
is 


< Pr (max Selle > v72). 


which, by Lévy’s inequality with stars (3.12(b)), is 
2Pr* | X Uni = nit) 
i=1 


< 2Pr* {n1 sup 
d(s,t)<t 


IA 


> ysin] 
T 


DE (Xni — Vaal) = Xnals) + Vaats)) | > y] 
i=1 


By (9.13) the latter becomes less than any given ¢ > Ofor0 < T < To(¢) small 
enough and n > no(¢) large enough. Take ¢ < 1/2. Then 1 — ¢ > e~*5, and 
so n Pr*(A,) < 2¢. Thus as tļ0 and n > oo, 


nPr* (Un — Unacllr > yvn) > 0. 
Thus, applying (9.14) and (9.15) gives 
lim,—o(lim sup)q—o0nPr* {Un — Unale > yva} 
= limo nPr* fUn illr > yvan} = 0. 


Next, by Lemma 3.6 and its proof, ||X,,,;||} and ||Vņ„ ||} are independent. 
Applying also Lemma 3.8, and recalling that X„ ı and V, ı are copies of Y,,, 
we get 


nP {Unii > yn} > nPr {Yll} > 2y An} Pr {Yall < yva}. 
By Lemma 9.5(a), 


(9.16) 


Pr*{||Un 1/279 
nPr{|| Yal} > yn} < nPr“{lUnaillr > yn’ /2} . 
1— SuPer Pr{|Y,(t)| > yni/2 /2} 


Here the numerator — 0 as n — ov by (9.16) and sup,¢7 Pr{|¥n(t) > yn! /2} 
— 0 as n — oo by hypothesis (ii). So the conclusion (9.11) holds. The proof 
for subsequences is the same. 
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Next will be a characterization of Donsker classes, in which the asymptotic 
equicontinuity condition (Theorem 3.34) is put into symmetrized forms. For a 
class F of functions let 


F; i= Fs = {f—e: figéeF, prlf.g) < ô}. 
Let ||- ls F := ||- Ig where G := F;. 


Theorem 9.14 Let(X, A, P) be a probability space and F a class of functions 
with F C L7(X, A, P). Assume (9.2), so that £; are i.i.d. Rademacher functions, 
independent of X,, X2, . . .. Suppose that for each x € X, 


F(x) := sup |f(x)— Pf| < œ. (9.17) 
SEF 


Then the following are equivalent: 
(a) F is a Donsker class for P; 
(b) F is totally bounded for pp and for any £ > 0, as 6 > 0, 


>e} > 0; 


S > ei(f (Xi) — Pf) 


i=l 


lim sup Pr* } n71/? 
noo 


ô, F 


(c) (F, pp) is totally bounded and as 5 + 0, 


— 0; 


X e( f(Xi) — Pf) 


i=l 


lim sup n~! E* 
n—> o0 


ô, F 


(d) (F, pp) is totally bounded and for v, := n'/?(P, — P), as 5 > 0, 
lim supp o E* livn lls, F > 0. 


To prove the theorem, we first show that (a) implies (b). Applying Lemma 
9.6(a) to the class Fj, the outer probability in (b) is bounded above by 


2 max Pr* 3 n7!/? 
k<n 


k 
Yi f(Ki)— Pf > e/2 


i=l 


OF 


For any fixed N, we can write for n > N that 


max = max(max, max ). 
k<n k<N N<k<n 


By the asymptotic equicontinuity condition (Theorem 3.34(b)), there are an 
N < œ anda ô > Osuch that for N <k<n 


Pr* n! 


k 
YX- Pf) 


i=l 


> e/4¢ < £€/4, (9.18) 


ô, F 
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since n~!/? < k~'/*, Then for 1 < j < N, note that 


J N+j N+j 
i=l i=1 i=j+l 


and inequalities for }*”_, also hold for panko Thus (9.18) holds also, with 
é/4 replaced by ¢/2, for 1 < k < N, and (b) holds. 

Now to prove (b) implies (c), Theorem 9.13 will be applied with D := T := 
F and Xn, (f) := e;(f(X;)— Pf). Then EX, (f) = 0 for all f € F, and 
(b) gives hypothesis (iii). For hypothesis (i) (9.10), again letting 1{...} := 1,4, 
we have 


nPr{| f(X;) — Pf| > Bn'/?} 
< BCE(F(X;) — PEEHI) — Pf| > pnr) > 0 


as n — oo by dominated convergence. Hypothesis (ii) holds since F is totally 
bounded for pp. So all three hypotheses of Theorem 9.13 and its conclusion 
(9.11) hold for the given X,, ;. In proving (c) the following will be helpful. 


Lemma 9.15 Let é;,i = 1,2,..., be i.i.d. nonnegative random variables such 
that 


M := supt? Pr{é& > t} < o0. (9.19) 


t>0 


Then, for all r such that <r < 2, 


sup n™ E max E < 00. (9.20) 


n l<i<n 


Proof. We have, as in the proof of Theorem 9.9, for each n, 


CO 
n—/? E max f= rae f t"! Pr{ max & > t}dt 
0 l<i<n 


l<i<n 


[0,6] 
< Ltr? f t! Pr{é > t}dt 
Jn 
9x r 
<1+ mrn | tdt =1+M , 
Jn 2-r 


proving (9.20). 


Now continuing the proof that (b) implies (c), for & = || f(Xi) — PfI 
condition (9.19) follows from equation (9.11). This implies, e.g., for r = 3/2, 
that the sequence {max;<n &i/ vn yi is uniformly integrable. But, again by 
(9.11), we also have, letting y —> 0, that max;<, &;/,/n — 0 in probability. It 


follows (RAP, Theorem 10.3.6) that 
E max §;/./n > 0. (9.21) 
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Next, Theorem 9.9 will be applied, where the original sample space S as in 
(9.2) is replaced by S x {—1, 1}, the random variables are (Xj, €;), and in 
place of F, we take the class G of functions g := ga where for each h € Fy, 
g(x,s) := s(h(x) — Ph) for s = +1, so that g(X;, €;) = e;(h(X;) — Ph). 
Also let p := 1. Then by (9.21), the hypothesis of Theorem 9.9 holds, and in 
the conclusion, the first term on the right is o(n'/*) as n > oo and ô > 0. In 
the definition of u, by the Lévy inequality (3.12), we have u < 2t; if 


P: (| XO 8(Xi ei) gr n) < 4, 
i=] 


and t; = o(n!/?) also by (b). So (c) follows. 
(c) implies (d) by desymmetrization, Lemma 9.10. 
(d) implies (a) by Markov’s inequality and the usual asymptotic equicon- 
tinuity condition, Theorem 3.34. 
The proof of Theorem 9.14 is complete. 


Next, Theorem 9.14 will be extended to multipliers other than Rademacher 
variables. Suppose we have a probability space (S, S, P), a countable product 
Q’ of copies of (S, S, P) with coordinates Xj, and a probability space Q”, 
another product space with coordinates £; which are i.i.d. Rademacher vari- 
ables. Let Q be a Borel probability measure on [0, co) with Q({0}) < 1 and let 
Q” be a countable product of copies of [0, 00) with coordinates vj, each with 
Borel o-algebra and probability Q. 

For any real random variable Y let 


Ao (Y) := [enn > t)|!dt. 
0 


The following is known. One reference, Stein and Weiss (1971, Section V.3 
on “L(p, qy” spaces), is related but considers function spaces on [0, +00) with 
Lebesgue measure. I will give an elementary direct proof. 


Lemma 9.16 For a real random variable Y, 
(a) A2,\(Y) < œ implies E(¥*) < ox. 
(b) For any 8 > 0, E\Y|°+° < 00 implies Az\(Y) < œ. 


Proof. Let F be the distribution function of |Y |. (a): given fy vI = F@dt < 
+00, we want to show fy 1’dF(t) < +00, or equivalently {>° r?d(1 — 
F)(t) > —oo. Integrating by parts, we get rd — FIO — 21, t(l — 
F(t))dt > —oo. The boundary term at 0 is clearly 0. Let g(t) := VI = F() 
for t > 0. We need to show that for a nonincreasing function g > 0 with 
i g(t)dt < +00 that tg(t) > 0 as t > +00, as is surely well known, but 
suppose not. Choose t — +00 with i. g(t) > £ > 0 for all k. We can and do 
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assume ft, > 2t,_; for all k. Then for each k > 2, 


f gars (: al ies E 
——]e2>c. 
fei = tk 2 


Summing over k gives a contradiction. So the upper boundary term is also 0 
and we need to show foz tg?(t)dt < +00. This integral is bounded above by 


SG + Dg) = > 3 1- gay = 3 2 KOH 


n=0 n=0 j=0 j=0 n=j 
CO 


(oe) oe) 2 
<a) Dan) < | Í eoar] cia 
j=l n 


=j 
which proves (a). 
(b): Suppose E|Y|?+? < œo. Then +00 > g(y) := fe ttêd F(t) > 0 as 
y > oo and g(y) > y?+9(1 — F(y)) for y > 0. Thus g(y) = — ie rtrd — 


F(t)) can be integrated by parts and equals sa — F(t))(2 + ô)t!+°dt. By the 
Cauchy—Bunyakovsky—Schwarz inequality this implies 


1048/2 


| Vons f VOnt 


oo 1/2 oe) 1/2 
< ( i d— roar) ( | ar) < +00, 
1 1 


proving (b) and the Lemma. 


Remark. Part (a) shows that part (b) does not hold for any 6 < 0 and will be 
used later in this chapter. 


Lemma 9.17 Let Q = Q x Q” x Q” be the product of the three probability 
spaces defined in the last paragraph, with coordinates Xj, £j, and v;. Let 
F C L?’(S, S, P) and Fr(x) := sup rer | fœ]. Assume that E* Fr < +00. 


(a) Let &; := €;v; for each j, so that vj = |&;| and &; are i.i.d. real symmetric 
random variables. Then for integers 0 < m < n < œ we have 
n P(Eu)E*|| ia e FXF < nP EN Vi éi f De 
< mn P(E*| f(X1)|_F)E(max;<n vi) (9.22) 
+ A2, (v1) MaXm <k<n cere Il S £i f (Xd |lF. 


(b) If Q" is replaced by a countable product of copies of R and Q by a 
Borel probability measure with f[*,, xd Q(x) = 0, so that the coordinates v; 
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are centered (Ev; = 0), then we have forO <m <n 


n P(E ly — v| /DE* | Die f Xie < oP E*N Di eivi f(r 
< 2mn7'?(E*|| f(X)llx) Emax; <p [vi l) 


+ 3Ao,1(¥1) MaXm ckan K- E*|| SL, e; f (XII F- (9.23) 
Proof. Part (a): writing || || := || - || z, we have by Theorem 3.9 (the one-sided 
Tonelli—Fubini theorem) 
E*|) ESAD = EEXET | eimi f| = E* |) Ev f(X) 
i=l i=l i=l 


by Jensen’s inequality, Lemma 9.7(a), applied to F,. The first inequality in 
(9.22) follows. For the second we have that 


E(n) := E* = F* 


yo i F(X) X eivi f(Xi) 
i=l i=1 


n 


> ( f Irena) ei f(X) 


i=1 


f (È tanes) dt 
0 i=] 


Let F(X, £, v) = nP Yi lesje f(XDI] and Umax = MaX j<n Vj. 
For any ¢ and w = (@’, œ", œ”) we have 


= E* 


n 
F,(X,e,v) < G(X, v) = nP Lye t ue 2 FF(X;), 


i=1 


which is finite almost surely. For almost all œw, F;(X, €, v) is a finite-valued 
step function of t, equaling O for f > Umax. Thus i F,(X, £, v)dt is defined 
and finite. Also, Ai F(X, £, v)dt > 0 as k > œ almost surely, by dom- 
inated convergence. For each j,k =1,2,... and t > 1/k let Hg :(œ) := 
Hg (X, €, v) = F;jk(X, €, v) for j/k <t < (j + 1)/k. Then Ay (X, £, v) > 
F,(X, £, v) for all t > 0 as k > œ since t > F; is left-continuous. We have 
Hg :(@) < G;(x, V) since if Hg (œw) A 0, then Hz (œ) = Fjjk(X, £, v), where 
j/k < Vmax and j/k < t < (j + 1)/k imply t < 1 + Umax. We have 


oo n 
f O vdt = nA + vma) Y> FFX) < $00 
0 


i=1 
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for almost all w. Therefore, by dominated convergence, 
+00 oo 1 oo 
f F(X, e, v)dt = lim Í, Ay A(X, £, vidt = lim 7 2, Fi/x(X, £, v) 
almost surely. Then 


n“! E* 


/ j (x lren êi ræ) dt 
0 = 


lee] [0,6] 
e f F(X, e, v)dt = E* lim J F;;K(X, e, v)/k 
0 k>oo j=l 


IA 


IA 


CO 
E* lim inf ) F* (X, €, v)/k 
j=1 


oe) 
lim inf y+ 00 E* Fi x(X, £, v)/k 
j=l 


IA 


by Fatou’s lemma with stars, Theorem 3.11. 
For each set G C {1,2,...,n}andt > 0 let 
H(G,t) := {v := {v;i}; € [0, +00)": t < v; ifand only if i € G}. 


L 


Itis easily seen that for any measurable set A and functions f, g > Owith f = g 
on A, (f14)* = 1a f* = 14g” a.s. Given ż, the sets H(G, t) are disjoint and 
Borel measurable with union [0, 00)”. For v € H(G, t), we have F,(X, €, v) = 
n—1/?|| Seg f(Xi)|l. So by Lemma 3.6(c), 


* 


Daf) 


ieG 


F*(X, £, v) =n? > la, nv) 
G 


Each term of the finite sum is a function of v times a function of (X, £). As we 
have a Cartesian product probability space, it follows by Theorem 3.3 and then 
the ordinary Tonelli—Fubini theorem that 


Xafa) 


icG 


E*F, = E(F") = n"? Ñ Priv € H(G, 1))E* 
G 


The }°, can be written as ) -z-o Xg: jax: For |G| = k = 0, the )°;<¢ and 
the envelope of its norm are 0. For each k = 1,...,7, 


>. Pr{v € H(G, t)} = Pr ps lyzi = | . 


G: |G|=k i=1 
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For any G with |G| = k, 


aie 


icG 


E* = E* 


k 
Yo ei f (Xi) 
i=l 


since probabilities for {(€;, X;)}/_, are preserved by permutations of the indices 
i. It follows that 


PROGea = n"? > Pe (> luz) -*] -E* 


k=1 i=1 


k 
Yo ei f (Xi) 
izi 


Thus E(n) is bounded above by 


00 n n k 
f (Eef mas = | E* Yesæ|)a 
0 k=1 i=1 i=l 


oo n 
< P Lisa >O} dt E* 
= (/ zp tzn > | ) a 


i=1 


A > be] tien = ta) 


k=m+1 


k 
Yo ei f(Xi) 


i=l 


x max E* 
m<k<n 


k 
KY ei f(X) 


i=1 


Now, 


: z 1/2 
ELI ne -1| =E (Erue) 
1 i=1 


k>m 


i= 


T 1/2 
< (« ye oy = n!’ Pry, > A. 
izi 


Thus by subadditivity of E* of norms (Lemma 3.5) 


E(n) 


IA 


m i Pr {Vmax > nar) E*\| f(Xll 


0 


k 
ko? fX 


+n!’ Aa (vi) max E* 
m<k<n 4 
i= 


= mE*|| f(X))||E(@max) 


k 
kW? “6 f(Xi) 


+n? As 1(v1) max E* 
m<k<n = 
i= 


’ 


finishing the proof for part (a). 
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For part (b), let €; := vj — nj, J =1,...,. We have 


n 


5 vi f (Xi) 


i=1 


E* < E* 


> oi (Xi) 
i=1 
by Jensen’s inequality (9.3) applied to functions g(X, v) := vf (X). Part (a) 


applies for ¢; in place of &;. Clearly, E(maxj<, |¢;|) < 2E(max;<, |v;|). We 
also have 


mee | Pritil > 1)"/2a1 
0 


< f ” Prl >t/2Ņ) P dt < 3A21(01). 
0 


For the lower bound, it is easily seen that 


>> bi f(X) 


i=l 


n 


X ufa) 


i=1 


E* < 2E* 


’ 


and the conclusion follows. 


Next, here is a characterization of Donsker classes in terms of multi- 
pliers &;. 


Theorem 9.18 Let F be a class such that hypotheses (9.2) hold and {f — 
Pf: f € F} has a finite envelope function (9.17). Let &; be i.i.d. centered 
real random variables, independent of Xi, X,..., specifically, defined on 
a different factor of a product probability space, such that E\&\| > 0 and 
Ao1(&1) < œ. Then F is Donsker for P if and only if both (F, pp) is totally 
bounded and 


-1/2 


VEX- Pf) 


i=l 


lim lim sup E* { n 
510 n> 


= 0. (9.24) 


ô, F 


Proof. We can assume that each f € F is centered, replacing f by f — Pf. 
Recall that (a) implies (c) in Theorem 9.14. Also, we have sup,„o f(Pr(|&1| > 
tyir< A2,1(Ẹ1) < 00, so (9.19) holds for ||. Thus by (9.22), it follows that 
(9.24) holds. Conversely, assume that F is pp-totally bounded and (9.24) holds. 
Then by the first displayed inequality in Lemma 9.17, applied to F, for each 
ô > 0, and since E|&,| > 0, it follows that (c) in Theorem 9.14 holds, so by 
that theorem, F is Donsker for P. 


Now, the proof of the bootstrap central limit theorem in probability, Theorem 
9.3, will be finished. Let F be P-Donsker. We can, again, assume that Pf = 0 
for all f € F since each f € F can be replaced by f — Pf without changing 
Va( f) or vë (f). Recall that the conclusion is equivalent to a statement in terms 
of the metric Br. Since F is totally bounded for pp by Theorem 3.34, for any 
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T > 0 there is an N(t) < œ anda map 7; from F into a subset having N(t) 
elements such that pp(7,f, f) < t for all f € F. Let vf (f) := vB (af) 
and Gp (f) := Gp(2,(f)). Then for any bounded Lipschitz function H on 
£°(F) for ||- Ilr, 


|E” Hp) — EH(Gp)| < |E” HQ) — E” Hvi.) 
+E” H(t ,)— EH(Gp,x)| + |EH(Ge,r) — EH(Gp)| 
=: An (H) +F Bn (H) + Cı(H). 
Let BL(1) denote the set of all functions H on 4%(F) with bounded 
Lipschitz norm < 1. Then supyegra) Cr(H) > 0 as t > 0 since by defi- 
nition of Donsker class, Gp is a.s. uniformly continuous with respect to pp, 
so Gpr: —> Gp uniformly on F as t}0. For fixed t > 0, the supremum for 


H e BL() of B,,,(H) is measurable and converges to 0 a.s. by Theorem 
9.4, the finite-dimensional bootstrap central limit theorem, in RY®. Recall the 


definition of || - ||5,- from before Theorem 9.14. For the A,,, term, we have 
SUP HeBL() Ån (H) < 2E*||v2 |l-,F, so it will be enough to show that for all 
€ > 0, as 510, 
lim sup Pr* {EP || v? ls F > e€} > 0. (9.25) 
n> 


Apply (9.8) with B := €°(F;) and v; := 45y,() for each w, and the starred 
Tonelli—Fubini theorem where one coordinate is discrete (3.4). We get 


X (xe, — Pu) 


E*E* vp lle = nC E*E” 


i=l ô, F 
< n" E |W, - DOx, — Pa) 
e= i=l ôF 
< n2 p* |Y N; — 98 
se 2 )ôx, 


5,F 
e n 
+ nV Es ( YON- blest) =: S+T. 
i=l 


Since F is Donsker for P, the limit as ô | O of lim sup, S is 0 by 
Theorem 9.18. For T, recalling (3.4), we have E| wre Ni —1)| <n'/?, so 
T < (e/(e — 1))E* ||P, lls F. Recalling that each f € F is taken to be centered, 
we have E*|| P ll z = E*||ln~!/2v, + Plls 7 ~ 0 as n > œ and ô > 0 by 
Theorem 9.14(d). Thus Theorem 9.3 is proved. 
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9.3 Other Aspects of the Bootstrap 


B. Efron (1979) invented the bootstrap, and by now there is a very large 
literature about it. This section will address some aspects of the application of 
the Giné—Zinn theorems. These do not cover the entire field by any means. For 
example, some statistics of interest, such as max(X,,..., X,,), are not averages 
F(X) +.---+ f(X,)) as f ranges over a class F. 

Some bootstrap limit theorems are stated in probability, and others for 
almost sure convergence. To compare their usefulness, first note that almost 
sure convergence is not always preferable to convergence in probability: 


Example. Let X,, be a sequence of real-valued random variables converging 
to some Xo in probability but not almost surely. Then some subsequences Xn, 
converge to Xo almost surely. Suppose this occurs whenever ng > k? for all k. 
Let Y, := Xx for 2% <n < 2**! where k = 0, 1,.... Then Y, > Xo almost 
surely, but in a sense, X, — Xo faster although it only converges in probability. 


Another point is that almost sure convergence is applicable in statistics when 
inferences will be made from data sets with increasing values of n, in other 
words, in the part of statistics called sequential analysis. But suppose one has 
a fixed value of the sample size n, as has generally been the case with the 
bootstrap. Then the probability of an error of a given size, for a given n, which 
relates to convergence in probability, may be more relevant than the question 
of what would happen for values of n —> oo, as in almost sure convergence. 

The rest of this section will be devoted to confidence sets. A basic example of 
a confidence set is a confidence interval. As an example, suppose X,..., Xn 
are iid. with distribution N(j1,07) where o? is known but u is not. Then 
X := (Xytet+ X,)/n has a distribution N (u, o?/n). Thus 


Pr(X < u — 1.960 /n!) = .025 = Pr(X > u + 1.960 /n!°). 


So we have 95% confidence that the unknown u belongs to the interval [X — 
1.960 /n!/?, X + 1.960/n'/], which is then called a 95% confidence interval 
for u. 

Next, suppose X,,..., Xn are i.i.d. in IR‘ with a normal (Gaussian) distribu- 
tion N(u, o? I) where I is the identity matrix. Suppose a > 0 and Mọ = M,(k) 
is such that N(O, D{x : |x| > Ma} = a. Thenn!/2(X — lL)/o has distribution 
N(O, I) so Pr(|X — u| > M,o/n"/?) = a. Thus, the ball with center X and 
radius Mao /n'/ ? is called a 100(1 — œ)% confidence set for the unknown u. 

When the distribution of the X; is not necessarily normal, but has finite 
variance, then the distribution of X will be approximately normal by the central 
limit theorem for n large, so we get some approximate confidence sets. 
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Now to extend these ideas to the bootstrap, let X;,..., X, be i.i.d. from an 
otherwise unknown distribution P. Let P, be the empirical measure formed 
from X,,..., Xn. Let F be a universal Donsker class (Section 10.2). Then 
we know from the Giné—Zinn theorem in the last section that v? and v, have 
asymptotically the same distribution on F. By repeated resampling, given a 
small œ > 0 such as æ = .05, one can find M = M(q) such that approximately 
Pr(||v2 |F > M|P,) = a. Then 


{Q: Q -— Pilz < M/n'} 


is an approximate 100(1 — œ)% confidence set for P. 


Problems 


1. If the sample space is the unit circle S! := {(cos@,sin@): 0 <0 < 
27}, and distribution functions are defined by evaluating laws on arcs 
{(cos@,sin@): 0 <68 <x}, show that the quantity in Corollary 9.2(c), 
sup, Ainn(x) — inf, Ainn(y)}, is invariant under rotations of the circle. 


2. (Continuation) For m = n = 20, find the approximation given by Corollary 
9.2(c) to the probability that there exists an arc A := {(cos 0, sinf): a < 0 < 
b} such that X; € A and Y; ¢ A for j = 1,..., 20, if all 40 variables are i.i.d. 
from the same continuous distribution on the circle. Hint: Not many terms of 
the series should be needed. 


3. In Corollary 9.2, 
(i) Show that the series in (b) and (c) diverge when u = 0. 


(ii) Show that for u = 0, the probabilities in (b) and (c) equal 1 for any m > 1 
andn > 1 and for y,. 


(iii) Show that the three expressions on the right are all less than 1 for u > 0 
and converge to | as u }0, so that the corresponding distribution functions are 
continuous everywhere. Hint: For (b) and (c) use the expressions in terms of 
Brownian bridge limits. 


4. Let X1, X2, .. . , be i.i.d. in R with a continuous distribution. Given n, arrange 
X,,...,X, in order as Xa) < Xo) < -< Xm. Let xz, j=l,...,n, be 
i.i.d. (P,,), a bootstrap sample. Thus Pr( X}, = Xa) = l/nfori, j =1,...,n. 
Let Xf) := minj<j<n XB. 
(a) Find pni := Pr(X G3) = Xp) fori =1,...,n. 

(b) What does p,; converge to as n —> œ for each i = 1,2,...? 


In the next two problems, let (X, A, P) be a probability space and suppose 
that F C L”(X, A, P) is a Donsker class for P. Let X1, X2,..., be ii.d. (P). 
Convergence in law, conditional on P, or P2,, in outer probability as n —> ov, 
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is defined just as in the definition of bootstrap Donsker class early in Section 
9.2, except for some interchanges of n and 2n. 


5. Given P, formed from X1,..., Xn, let 2n points X(j, 6) := xf j= 
1,...,2n, be iid. (P,), and Py, := 3 7") ôxy.p. (The 6 superscript is 
analogous to the B superscript for the bootstrap, but here the bootstrap sample 
size is 2n.) Show that (2n)!/ 2( ph — P,,), conditional on P„, converges in law 


to Gp, in outer probability. 


6. Given P2, formed from X),..., X2,, let n points X(j, y) := x J= 
1,...,n, beii.d. (Pn), and PY := 1 D Sxg,y). Show thatn!/?(PY — Pon), 
conditional on P»,, converges in law to G p, in outer probability. 


7. Prove the statements before Lemma 9.17, that A2,1(Y¥) < co implies E(Y?) < 
oo, and for any ô > 0, E|Y|?*° < œo implies A2,ı(Y) < 00. 


Notes 


Notes to Section 9.1. I do not have a reference for Theorem 9.1. A form 
of the theorem for classes of sets was given in Dudley (1978). Corollary 
9.2 is classical and gives asymptotic distributions of so-called Kolmogorov— 
Smirnov (parts (a),(b)) and Kuiper (part (c)) statistics. See the notes to RAP, 
Section 12.3. 


Notes to Section 9.2. The section is based mainly on Giné (1997), an update (as 
regards measurability) of the fundamental theorems of Giné and Zinn (1990). 
Lemma 9.8 and Theorem 9.9 are based on analogous results of Hoffmann- 
Jørgensen (1974). The proofs also include some new elements. Giné (1997) 
gives a lot of references, not reproduced here, on different parts of the proof. 
Some special cases of the Giné—Zinn theorems were published earlier by Bickel 
and Freedman (1981), Singh (1981), and Gaenssler (1986) among others. These 
papers also give some results other than special cases of the Giné—Zinn theo- 
rems, on bootstrap asymptotics for sample quantiles and other functionals. 


Notes to Section 9.3. Efron (1979) discovered the bootstrap. Three books on 
it are Hall (1992), Efron and Tibshirani (1993), and Shao and Tu (1995). 
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Uniform and Universal Limit Theorems 


In this chapter we look at cases where a class F of measurable functions is a 
Glivenko—Cantelli or Donsker class for all probability laws P on the underlying 
space and ask if so, whether the convergence is uniform in P. 


10.1 Uniform Glivenko—Cantelli Classes 


Let (X, A) be a measurable space. Let P(X) = P(X, A) be the set of all 
probability measures on (X, A). Let F be a class of measurable real-valued 
functions on X. Then F is called a strong uniform Glivenko—Cantelli class if 
for every £ > 0 there is an nọ such that 


sup Pr(||P, — P|} > £ forsome n > no) < €. 
PEP(X,A) 


A class F will be called a uniform Glivenko—Cantelli class in probability if 
for every £ > 0 there is an no such that 


Pr(|| Pa — Pll > €) < € 
for all n > no and all P € P(X, A). 


Proposition 10.1 Let (Q, B, P) be a probability space and (X, || ||) a normed 
space. Let X1, Xz,..., be X-valued functions on Q such that for each n and 
sj =1 or —1 for each j =1,...,n, || A s;Xj;\| and ||X,|| are measur- 
able, the joint distribution of || X`}; s:Xj\| for j = 1,...,n does not depend 
On Si, ...,Sn and ||X;|| for j =1,...,n are iid. Let S, := pee Xj. If 
| S,||/7 — 0 in probability as n — œ, then n Pr(||X,|| > n) > 0. 
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Proof. Let Sy := 0. For any n and j = 1,2, ...,n,if || X; || > n, then we cannot 
have ||S;|| < n/2 for both i = j — 1 and j. Thus 


1 
Pr ( max IX; >n) < Pr (max Sj ll /n > 5): (10.1) 
Sjan j<n 


Lemma 3.12 is a form of P. Lévy inequality. Its hypotheses per se do not apply, 
but the conclusion holds with about the same proof. We do not need stars 
since the norms here are measurable. It gives that the right side of (10.1) is 
< 2Pr(||S,||/n > 1/2) —> 0 as n — oo. Then by (10.1), for pa := Pr(||X1|| > 
n), (1 — pn)” — 1, which implies np, — 0. 


Proposition 10.2 For any measurable space (X, A) and class F of real-valued 
measurable functions on it, if F is a Glivenko—Cantelli class for every P on 
(X, A), then sup r_-(sup f — inf f) < +00. 


Proof. Suppose not. Then there are fe € F and points x, and yg such 
that f(x) — feyk) > 8* for all positive integers k. Let P := ye + 
& 72". Then P is a probability measure (law) defined on all subsets of 
X and so on A, and F is Glivenko—Cantelli for P. The countable subset 
G = { fk}k>1 is also Glivenko—Cantelli for P. Norms || ||g in what follows will 
all be measurable because G is countable. Let P! be an independent copy of 
P,,. Then ||P, — P/||g —> 0 in probability. Here P, — P! = 1 >, Vi where 
V; = ôx) = Oy (i) where X(1), EE X(n), Y(1), TEER Y(n) are i.i.d. (P). Thus Vi 
are symmetric, i.e., any V; can be interchanged with — V; without changing the 
joint distribution of the ||: ||g seminorms of any partial sums. It follows from 
Proposition 10.1 that n Pr(|| Vilig > n) — 0 as n — oo. Then by definition of 
P, for each k, and n = 8*, 


nPr(||Villg > n) > nPr(X(1) = x, and Y(1) = yz) > 8*/4**! > oo 


as k — oo, a contradiction, proving the proposition. 


If F is a class of bounded functions, let Fo := {f — inf f : f € F}. The 
following is immediate: 


Proposition 10.3 Let (X, A) be a measurable space and F be a family of 
bounded measurable real-valued functions on X. For any f € F, constant 
c = cy, in particular cf = inf f, and any two probability measures P and Q 
on A, (P — OXF) = (P — OXF — cp). Thus ||P — Olly = ||P — Qllx,. This 
holds in particular if Q is any P,,. 


We have the following equivalence: 


Theorem 10.4 For any separable measurable space (X, A) and class F of 
measurable real-valued functions on X such that Fo is image admissible Suslin, 
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F is a uniform Glivenko—Cantelli class in probability if and only if it is a strong 
uniform Glivenko—Cantelli class. If F is such a class, then 


lim sup E||P, — P || F = 0. (10.2) 
n->oo P 


Proof. “If” is immediate. For “only if,’ by Proposition 10.2, Fo is uniformly 
bounded. By Proposition 10.3, we can assume that F = Fo and that || f |lsup < 1 
for all f € F. Then always ||P, — P||z < 1. By Theorem 6.6(b), for any P 
on (X, A), (||P, — P || F, Sn) is a reversed submartingale with S, the smallest 
o-algebra for which all P,(f) for k > nand f € £'(X, A, P) are measurable. 
By assumption, for any ô > 0 there is an nọ = no(ô) not depending on P such 
that for all n > no, Pr(||P, — P || > 6) < 5. Given ¢ > 0, let 6 = £?/2. For 
any N > no(d) let Y; = || Py_; — P ||F for j =1,..., N — no. Then Y; forma 
submartingale for the o-algebras Sy_j;. Let A(e, k) = {max,<j< Yj = £}. By 


Doob’s maximal inequality (e.g., RAP, Theorem 10.4.2) with k = N — no, 
e Pr(A(e, k)) < Ell Pa — Pilz < 26 = £°. (10.3) 


Thus Pr(A(e, k) < £, and this holds for all N > no. Letting N — oo we have 
Pr (sup ||P, — P || F > e) <6, 
n>no 


showing that F is a strong uniform Glivenko—Cantelli class, and by (10.3) for 
any n > no(6) in place of ng, (10.2) also holds. 


Proposition 10.5 Let (X, A) be a measurable space and F a uniformly 
bounded class F of measurable functions, totally bounded for dsup. Then F is 
a strong uniform Glivenko—Cantelli class. 


Proof. Given €e > 0, suppose that functions fi, ..., fm form an ¢/4-net in F 
for dsup. For each f € F there is an f; such that dsup( f, fj) < €/4. It follows 
that for any n, |(P, — P)(f — fj)| < ¢/2. Thus if we can find an no such that 
(Pr — P)(fj)| < €/2 for alln > no then |(P, — P)(f)| < £ forall n > no. So 
it suffices to show that a finite set of bounded measurable functions is a strong 
uniform Glivenko—Cantelli class, as it clearly is by the strong law of large 
numbers. 


Ifa class C of measurable sets is a uniform Glivenko—Cantelli class, it must be 
a VC class by Assouad’s theorem 6.27. Conversely, uniformly bounded classes 
F of functions satisfying conditions of Vapnik—Cervonenkis type and suitable 
measurability conditions will be shown to be uniformly Glivenko—Cantelli. 

Let (X, A) be a measurable space and F a family of bounded measur- 
able functions on X. For x = (x1, ..., Xn) n =1,2,..., 1 < p < œ, and any 
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f, g € Fo, define the pseudometric 
1/p 


1 n 
ex pf 8) = | =D ISE- a 
j=l 


Then D(e, Fo, éx,p) < +œ is defined. Let 


An, p(€, Fo) := sup log D(e, Fo, ex, p). 


xex" 


We have: 


Theorem 10.6 Let F be family of bounded functions on X for a separable 
measurable space (X, A) such that Fo is image admissible Suslin. Then the 
following are equivalent: 


(a) F is a uniform Glivenko—Cantelli class in probability; 
(b) F is a strong uniform Glivenko—Cantelli class; 
Forl < p < œ, 


(Cp) Fo is uniformly bounded, and for all € > 0, limy—o0 An,p(€, Fo)/n = 0. 


Proof. Statements (a) and (b) are equivalent by Theorem 10.4, and (a) implies 
that Fo is uniformly bounded by Proposition 10.2. By Proposition 10.3, each 
of (a) and (b) holds for Fo if and only if it holds for F. So we can and will 
assume F is uniformly bounded in the rest of the proof. 

Next it will be shown that (c1) implies (b). Let P be any probability measure 
on (X, A). Let {ez}x>1 be iid. Rademacher variables independent of {X j}j>1. 
As in the definitions before Proposition 6.4, assume given as there random 
variables ø (i) and t(i) independent of each other and the X ;. Specifically, take 
a countable product X% of copies of (X, A, P) on which X; are coordinates, 
take another probability space Q’ on which the o(i) and t(i) are defined, 
and take yet another Q, on which ii.d. eg are defined. Take the products 
Q := X” x Q, and Q” := Q x Q with product probabilities. For P! and P% 
as defined before Proposition 6.4 we have P/ = + Da ôx, for X}, = Xoj) 


and likewise P; = + X’; dy; for Y; = X1(j). Thus 


1 n 
Pa- Py = -9 0x- ôr). (10.4) 
j=l 
Now ||P, — P||z and ||P; — P’’||- are universally measurable since F = Fo is 


image-admissible Suslin, by Proposition 5.20 and Corollary 5.25. The expres- 
sion in (10.4) has all the properties of 


1 n 
Va := — (ôx. — dy’), 10.5 
2 x, — ôy!) (10.5) 
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specifically, their || ||- norms are equal in distribution, because replacing any 
particular €; by —e; is equivalent to interchanging o (j) and t(j). Both opera- 
tions preserve probabilities. 

Given e > 0, letn > 16/e?. In Lemma 6.5 take ¢ := 1 and n = y/ne/2 > 2. 
Then the Lemma gives 


Pr(||Pn— PIF >e) < 2 PrP; — Pilly > €/2) =2Pr(lVallz > €/2. (10.6) 


For € € X” let x := x(&) := x,(&) := (X1 (8), ..., X2n(€)). By definition of 
D(é, Fo, éx,1), for each € € X” there is a map z, = xn from Fo onto a sub- 
set Gg of cardinality at most D(e/8, Fo, ex,¢),1) such that for all f € Fo, 
ex), (f, Inf) < €/8. 

Let Fo be image admissible Suslin via (S, S, T). To show that the choice of 
G, can be made measurably, we have that (x, s, t) œ> ex (T (s), T(£)) is jointly 


measurable. To construct G; := {f Te recursively, let f be any fixed fı € 


Fo. Suppose given SE fF with eal fP >e/8 fr l<i<j<r. 
Let AË := {f € Fo: exif, fi) > ¢/8} forl <i <r. If AË = Ø, then k(é) = 
r < D(e/8, Fo, ex,1), and the recursion is finished. If Ag Æ Ø, then {(€, s) € 
Xx S$: f=T(s)E€ AÈ} is an A% x S measurable subset of X x S. Let 
B, := {&: Ag + Ø}. Then by measurable selection (Theorem 5.22), B, is 
universally measurable, and there is a universally measurable map ¢, from B, 
into S such that Taa := T(¢,(€)) € A, (£) for all € € B,. 
We have 


1 


< / If — 7n fld(P! + Pl) < 6/8, 


1 
5 | / F = mi AAVal <5 


and therefore 


Pr(||Vallz > €/2) < Pr VaGtn( PF > €/4). 


The latter norm is a supremum over a finite set of functions for fixed £ and 
then is a measurable function of ws := {€k}k>1. Let Pr, denote probability 
with respect to ws for € fixed. Then by the Hoeffding inequality for linear 
combinations of Rademacher functions, Proposition 1.12, for each &, 


Pre (|Van (PIF > €/4) < 2D(E/8, F, ex,¢),1) exp(—ne? /(B2)). 


By (c1), for all n large enough, D(£/8, F, ex1) < exp(ne?/(64)) for all possible 
x = x,(&). For such n, taking the expectation with respect to £, where we have 
joint measurability in (€, œs) by the image admissible Suslin condition and 
measurable choice of J: 


Pr((|Vallz > €/2) < 2exp(—ne?/(64)). 
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Then by (10.6), summing a geometric series gives that for each £ > 0 there is 
ann, and a C = C(e) = 4/[1 — exp(—e7/(64))] such that 


sup 5 Pr(|| P — P || F > £) < C exp(—n-e7/(64)) <E 
k>ne 
where the supremum is over all probability measures P on (X, A), which 
implies (b). 
Next it will be shown that (a) implies (c2). The following is used: 


Lemma 10.7 If F satisfies (a) and P is such that F C L?(P), p > 1, and 
Pf =0 forall f € F, and £; are i.i.d Rademacher variables independent of 
the variables X ; (in the stronger product space sense), then, for any functionals 


di = af), 


X eidx, X ôx, 


Proof. The subscript F will be dropped from the norms for simplicity. By 
Proposition 10.2, Fo is uniformly bounded. For each f € F, Pf = 0 implies 
that sup f > 0 and inf f < 0. Thus F is uniformly bounded. Let A and B be 
disjoint sets of indices, let E4 denote integration with respect to X;, i € A, and 


p 


2-? E* < E* < 2P E* VS ix, + ai) (10.7) 


p I p 
F F F 


likewise for Eg. In the following display, the first two equal E*’s can be written 
as E% by Lemma 3.6. The first inequality holds by the Jensen inequality in the 
form of Lemma 9.7(b), and the last by the 1-sided Tonelli—Fubini theorem, 
Theorem 3.9, so we have 


P Pp 
ES #X%)| =E Y D+) D 
icA icA iceB 
i i (10.8) 
<E Y A eE f(D 
icAUB 1€AUB 


If Ey denotes E with respect to {X;} for {e;} fixed, and E, expectation with 
respect to {e;}, then 


Dafa 


by Theorem 3.9 again, now using discreteness of {e;}. If U and V are random 


E* 


P 
"= BEX] D- YD ræ 0.9) 
i:e;=1 1 


i:g;=— 


elements of a normed space (Y, ||'||) and 1 < p < œ, then by convexity 


Pp 


U-V 
2 


Let A and B be the two sets of i’s on the right side of (10.9). Let 
U := YViea f(Xi) and V := } jeg f(Xi). Using (10.8), also with A and B 


1 
=5 (Ui? + V7). 
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interchanged, gives using again the discreteness of {¢;} 


reo) DeO) 


proving the first inequality in (10.7). For the second inequality, similarly, again 
applying the Jensen inequality Lemma 9.7, 


DaD 
i=1 


P P 
E* U-V]? <2? E* |U +V ||? <2? E, E% = 2? E* 


PpP 
= E* 


Pp 


VOGA) — Ef Xni) 


i=1 


E* 


p 


VCXO + aD — DOF Xn) + aif) 


i=l i=l 


< E* 


’ 


which for any {¢;}, because P™ is invariant under permutations of the coordi- 
nates, equals 


P 


Daf AD + aN f Xni) — aif) 


i=1 


Ex 


Pp 
<2? E* 


’ 


Yo af XD + ai( f)) 


i=1 


as in the first half of the proof. 


Now to continue the proof that (a) implies (c2), recall Az, and Lemma 
9.16 about it. In (10.10) below A» of a random variable appears in an upper 
bound, which is useless when it is +00. Thus Lemma 9.16(b) gives a sufficient 
condition, namely E(|&|") < oo for some r > 2, for the bound to be finite. 


Theorem 10.8 Let F be an image admissible Suslin class of P-integrable 
functions on X for a separable measurable space (X, A). Let X; be X-valued 
random variables, and let £;, &, i € N, be respectively a Rademacher sequence 
and a sequence of symmetric i.i.d. real random variables, independent of each 
other, all coordinates on a product probability space X® x {—1, 1}? x R”, 
in particular, all independent. Then, for every 0 < no < œ and no < n E N, 
we have 


1 
(Elė DE | Fi yo es F(X) 


1 n A n 
„5E |En if (Xi) 


F 


1 
< m (Ell f(XVIly)E |— max cll (10.10) 


1 
+ ^2, (£1) max no<k<n E | Jk S a ied) . 
k F 
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Proof. To begin, the first inequality in (10.10) will be proved. For any n = 
1,2,..., the class F, of functions 


(Xi, ei Vy DS eli f XD 


i=1 


for f € F is image admissible Suslin on X” x R” x R” (by Theorem 5.26), 
and likewise if each ¢;|&| is replaced by €;. By symmetry, the joint distribution 
of the é; equals that of the ¢;|&;|, so that 


E =E 


F 


(10.11) 


1 Š 1 č 
F LESA Fa BFC) 


F 


where the norms are universally measurable by Corollary 5.25, so the expec- 
tations exist. For fixed {e;}, {X;}, and f, X; €:|& | f(X;) is a linear function 
of n, := {l&|}/_, and so its absolute value is a convex function of 7,. Thus 
| X elél f (Xi) ||, as the supremum of a family of convex functions, is a 
convex function of 7,. It follows then by Jensen’s inequality that the equal 


expressions in (10.11) are 


>E 


’ 


F 


1 n 
Ti 2 (EIEND f(Xi) 


which proves the first inequality. Now consider the second. From here on F is 
omitted from the norm signs. Let N, := #{i <n: |&| > t}. The expressions 
in (10.11) are also equal to 


E 


maS (/ Lid) ei f(Xi) 
i=1 9 


=E Jn” È jet lr<lélEi FX») dt | 


<f E [MPE a rse f X| at 


=f E 


ny Mes | dt 


< fo" (Cha PrN = HE | Dh FX) 


Jar <T+UV 
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where 


’ 


T := (JẸ PrN; > O}dt) max ren E [a ke f(X) 


1 fe <= 
U=— XO Vk Pr{N, = k}dt, 


k=no+ 1 


I k 
V := max E T cif (XD. 


no<k<n 
i=not+l 


We have T < (Jọ Pr {max;<, l&i] > t} dt) noE|| f(X1)//nll and 


oo n 


1 
U < — VKP(N, = k)dt. 
<f 3 r( )dt 


Let |i], ..., |En] in order (order statistics) be |E la) < lEl@ < < lEla, with 
|& |) := 0. Then N, is the number of values of i < n with |&|() > t, and N, = k 
if and only if |E |(,-¢41) = t > Ela- Now 


oo [0.6] 
jl Pr(lE ln- < t < lEla-+)dt = / E é\q-m<t<lE leary dt 
0 0 


lEln—e-+1) 
=E f dt. 
IE lar 


Next, 


n 


lE lake) lEliny 
D2) Vkdt = ef JN, dt, 


k=1 ” lEla-» 


and EVN, < VEN, = (E Dia lg) l = (Prigi > 9), so U< 
i Pr(|&1| > t)!/2dt = Ax (E,). Theorem 10.8 now follows. 


To continue further the proof that (a) implies (c2), for a given P, and F 
and Fo as in Theorem 10.6, let G := {f — Pf: f € F}. Since clearly Pf € 
[inf f, sup f] for each f € F, by Proposition 10.2, G is uniformly bounded. 
We have clearly Go = Fo, so Go is image admissible Suslin. By Proposition 
10.3, Go and G are uniform Glivenko—Cantelli classes, so (a) holds for G. For 
the given P, clearly G C £'(P), and Pg = 0 for all g € G, so the hypothesis 
of Lemma 10.7 holds for G in place of F and so also the conclusion. For p = 1 
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this gives form = 1,2,..., 


m 
J £iôx; 
i=l 


lps 
2 


g i=1 lig 


= E*||mPu|lg = mMmE*| Pn — PIF 
= mE|| Pa — P||z, = o(m) (10.12) 


as m — oo, uniformly in P € P(X, A), by Theorem 10.4. 
To continue the proof that (a) implies (c2), the following Gaussianization 
lemma will be used: 


Lemma 10.9 Under the hypotheses of Theorem 10.6 and (a), if 81, g2,... are 
i.i.d. N(0, 1), then 


; 1 n 
lim sup E|- 5° gj5x,] =0. (10.13) 
n> PEP(X,A) ni 


Fo 

Proof. Apply Theorem 10.8 in case &; = g;, for which clearly Az 1(g1) < 00. 
Of the three expressions in the two inequalities (10.10), we want to show that 
the middle expression E|| )°7_, gi f(Xi)||z/./n, divided by a further y/n, is 
op(1) uniformly in P € P(X, A). Let the two terms added in the last expression 
in (10.10) with & = g; be Ti + To. So we want to show that (Ti + T>)/./n > 
0. In T, since F is uniformly bounded by some M, E|| f(X1)||- < M. By 
Proposition 2.5, for any n > 2 and C > 0, 


Pr(max |g;| > Cy/logn) < nexp(—(C? log n)/2) = nC? 
jan 


It follows that 


lo) 
sup E (max 1sj1/vīoen) <2+ sup ) nl? < 00, 
n>2 Jn 


n>2 k=2 


so E(maxj<, |g;|) = O(/logn) as n > oo. Thus T, /,/n = O(no,/log n/n). 
To make this 0(1), we are free to choose np = no(n) < n in (10.10), and we 
need no = o(n/,/log n), so choose no to be asymptotic to n/(log n). 

For T», and each value of k with nọ < k < n, we have form := k — nọ < k 


X ef (Xi) 
A i=l 


which is o(m) and o(k) uniformly in P € P(X, A)asn — oo by (10.12). When 
divided by Vk it is o(V/k) = o(./n), so when further divided by \/n it is o(1), 
uniformly in P € P(X, A), as desired, and the Lemma is proved. 


k 
E| afa] =E 


i=nọ+1 


’ 


Fo 
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Now continuing the proof that (a) implies (c2), let x = (x1, ..., Xn) E€ X” 
and let P, be the corresponding empirical measure P, := DE ôx; For 
j=1,...,n let m(j) be i.i.d. random variables uniformly distributed over 
{1,2,...,} and independent of {g;}/_,, defined on another product space 
factor {1,2,...,n}". Then Xm) for j = 1,...,n are iid. P, (as in bootstrap 


sampling, Chapter 9). Applying (10.13) to P, gives 


1 n 
li E | — (Ox, | = 0. 10.14 
im, sup E [7 818 m0 (10.14) 


NO vex” 


The following claim will be proved: 


E 


1 n 
A 3 giôx 


1 n 
< (1 — e`!) !E |- by i 10.15 
< ( e ) n 28 m(j) ( ) 


Fo Fo 


To prove this let Aj; := {m(j) =i} for i, j =1,...,n. Then for each j = 
1,...,n, the sets A;; are disjoint for distinct i, each with probability Pr(A;;) = 
1/n, and |_J;_; Ai; is the whole probability space. Sets Aj,1, Ai,2,..-, Ain are 
jointly independent for any i1, ..., in. Let g;; fori, j =1,...,n be n? iid. 
N(O, 1) random variables. By disjointness and independence, the two n x n 
arrays of random variables 


{{gjlaj tear and {(gij la, h= 


have the same joint distribution on R”. It follows that 


E : > 848 emp =E D (>: lass) 
Jal j=l i=1 


Fo Fo 


(10.16) 


1 n 
=E 7 > Zij LA; 5x; 


ij=1 


Fo 


Conditionally on the events {A;;}, the random variables Xa gijla,, for 


i=1,...,mareindependent with distribution N (0, YS 1,,,), in other words, 
1/2 n 
equal in distribution to (Eh 1 Ay) «| . 
i=1 


Here is a subclaim: for each i, E./S; >1 — e™! where S; := Dai lay 
To prove the subclaim, the 14, for fixed 7 are i.i.d. Bernoulli (1/m) random 
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variables. We have V/S; > 1g, where B; := Uj-1 Ajj. Thus 


j=l 
1 n “4 
= 1 1 >l-e 
n 
So the subclaim is proved. 
We then have 
1/2 
1 n 1 n n 
E || — ij LAs Ôx =E|- la, iOx; 
LE aylaef SEE (Eu) we 
i=l F i=1 j=l 
0 Fo 
>(l-e')E Ly ai 
= n4 IVx; , 
i=1 Fo 


as follows. The first equality results from writing E = Eme E), where Emo 
is expectation with respect to the distribution of m(-), a finite sum, and E(g) 

=p using the conditional distribution 
given before the subclaim. Then, since {g;}/"., is independent of m(-), we can 


is conditional expectation given {m(j)} 


reverse the order of integration in the second expectation, where now Egg) 
is expectation with respect to the unconditional joint distribution of the g;, 
namely, i.i.d. N(O, 1). For fixed {g;} we then have by the Jensen inequality, 
Lemma 9.7(a), that Em) l| - -~ || = || Eon)... ||, and then applying the subclaim, 
we get the last inequality. Then using (10.16), the claim (10.15) is proved. It 
follows from (10.14) that 


lim sup E 


n> yexn 


=0. (10.17) 


i 
n 2 siðs 


For f € Folet X(f) := 1 *_1 8i f (x), a Gaussian process indexed by Fo. For 
h € Foalso we have ey 2( f, h) = /ndx(f, h). Thus fore > 0, D(e, Fo, ex2) = 
D(e/J/n, Fo, dx). By the Sudakov minoration, in the form of Theorem 2.22(b), 


we have 


Fo 


1 n 

=: DD gi Ôx; 
n^ 

i=l 
Then (c2) follows from this and (10.17). 

Lastly, it will be shown that the conditions (c,) for 1 < p < oo are all equiv- 
alent. Take M < oo such that || f lls < M forall f € Fo. For any {xj Vio, for 
the probability measure P, := 1 X a 6x, and1 <r < œ, exr(f, 8) = |f — 
8Ellrn, the L” norm for Pa, which is nondecreasing in r by Hölder’s inequality. 


1 
E > 77 sup e(log D(e, Fo, ex,2))'?/n'”. 


~ e>0 


16:25 


P1: KNP Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


CUUS2019-10 


CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


360 10 Uniform and Universal Limit Theorems 


Let 1 < p < q < œ. Then for any ¢ > 0, Hn,p(£, Fo) < Hn,q(€, Fo), so (cq) 
implies (cp). Conversely, ex, ( f, 8) < ex,p(f, g)P/A(2 M) TP, which implies 
Hn, a(€, Fo) < Hn, p(e4/? /(2M)4-P/?, Fo). So (cp) implies (cy). 


10.2 Universal Donsker Classes 


Let X be a set and A a o -algebra of subsets of X. Then a class F of measurable 
functions on X will be called a universal Donsker class if it is a P-Donsker 
class for every probability measure P on (X, A). 

Recall that every universal Donsker class of sets is a Vapnik-Červonenkis 
class (Theorem 6.24), and the converse holds (Corollary 6.20) under the usual 
image admissible Suslin measurability condition. 

For a real-valued function f let diam(f) := supf—inff. The following 
shows that a universal Donsker class is uniformly bounded up to additive 
constants: 


Proposition 10.10 If F is a universal Donsker class, then sup pep diam( f) < 
oO. 


Proof. Suppose not. Then take x, E€ X, yg E€ X and fk E€ F so that f(x) — 
fek) > 2* fork = 1,2, .... Let 


[oe 
P=) 6, 4+4,)2". 
k=1 
Then P is a probability measure defined on all subsets of X and so on A. For 
P, 


E(fe — Efe” = 2°" ink file) — 0)? + Fee) — 0}. 
The infimum is attained when c = (fx(xx) + fk(yk))/2, so 
Elfe — Efe” = 2* (fan) — fed)? > 2? 


for all k = 1,2,..., so F is unbounded in the pp metric and hence not a 
Donsker class for P by Theorem 3.34, and not a universal Donsker class. 


Proposition 10.11 For any family F of measurable functions on (X, A), any 
probability measure P on (X, A), any constants cs depending on f €F, 
G:={f— cs: fe F}, and H:=F+R:={f+c: f €F,c € R}, each 
of the following properties holds for all three of F, G, and H if it holds for any 
one of them: 

(a) Donsker for P; 

(b) Universal Donsker; 

(c) Glivenko—Cantelli for P; 
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(d) Glivenko—Cantelli for all P; 
(e) Uniform Glivenko—Cantelli. 


Proof. Since P, — P is linear and (P, — P)(c) = 0, (c), (d), and (e) are imme- 
diate. In Section 3.1 a Gp process on F was called coherent if each sample 
function G p(-)(@) is prelinear, bounded, and uniformly continuous on F with 
respect to pp. Here F is P-pregaussian if and only if a coherent Gp process 
on it exists (Theorem 3.2). For (a), a coherent G p process has G p(0) = 0 and 
can be extended to make Gp(f +c) = Gp(f) forall f € F and all c, remain- 
ing coherent. The total boundedness for pp and asymptotic equicontinuity are 
equivalent for F, G, and H. It follows by Theorem 3.34 that (a) holds, and (b) 
follows. 


Remark. If F is a class of bounded functions, in Proposition 10.11 we can take 
c(f) := inf f. Then G is a class of nonnegative functions. If F is universal 
Donsker, then by Proposition 10.10 G is uniformly bounded. 


The Vapnik—Cervonenkis properties of classes of functions treated in Sec- 
tions 4.7 and 4.8 (VC subgraph, VC major, VC hull) all have been (for VC hull 
and major in Theorem 6.21) or will be seen to imply the universal Donsker 
property for uniformly bounded classes of functions under some measurability 
conditions. So the relations among these different VC properties are of interest 
here. Recall (Section 4.8) that D”)(¢, F, Q) is the largest m such that for some 
fis- fm €F, S\fi -— f)\?dO > e? for all i 4 j. Also, D(e, F) is the 
supremum over all laws Q with finite support of D”(e, F, Q). 

The following is a continuation of Theorem 4.53. 


Proposition 10.12 There exist uniformly bounded VC major (thus VC hull) 
classes which do not satisfy (4.12), thus are not VC subgraph classes. 


Proof. A uniformly bounded VC major class is VC hull by Theorem 4.5 1(b). 
Let F be the set of all right-continuous nonincreasing functions f on R with 
0 < f <1. Then since the class C of open or closed half-lines (—oo, x) or 
(—oo, x] isa VC class (with S(C) = 1), F is a VC major class. It is rather easy 
to see (Chapter 4, Problem 10) that F is not a VC subgraph class. The interest 
here is in showing that (4.12) fails. 

For any f, g (in F) and any law Q, (f |f — g|°d Q)! > f |f — g|d Q. Thus 


D®(e,F, Q) > De, F, Q) for any ¢ > 0. 


Let P be Lebesgue measure on [0, 1]. Then DY (e, F, P) = D(e, LL21, d) 
where LL is the set of all lower layers (defined in Section 8.3) in the unit 
square I? in R?, and d} (A, B) := A(AAB). For some c > 0, D(e, LL21, dy) > 
e°? ase | Oby Theorem 8.22. For each ¢ > 0 small enough, by the law of large 
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numbers, there is a law Q with finite support and D(s, F, Q)> ele — 1, so 
(4.12) fails and by Theorem 4.53, F is not a VC subgraph class. 


Let F be a uniformly bounded class of measurable functions, so that some 
constant M with 0 < M < œ is an envelope for F. Limit-theorem properties 
of F are equivalent to those of G := {f/M: f € F}, so we can assume that 
M = 1. Then Pollard’s entropy condition as in Theorem 6.15 becomes 


1 
f (log D? (e, F))!Pde < oo. (10.18) 
0 


Theorem 10.13 Jf F is a uniformly bounded, image admissible Suslin class 
of measurable functions and satisfies (10.18), then F is a universal Donsker 
class. 


Proof. F has a finite constant C as an envelope function. For a constant enve- 
lope, the hypotheses of Theorem 6.15 do not depend on the law P, so Theorem 
10.13 is a corollary of Theorem 6.15. Here are some more details. Let F/C := 
{f/C : f € F}, sothat F/C has as an envelope the constant 1. It will be enough 
to show that F/C is a universal Donsker class. Then for 5 > 0, D? (8, F/C) = 
DO(S, F/C) as in Theorem 6.15. Make the substitution ô = ¢/C and note that 
D(e/C, F/C) = De, F). It follows that F/C satisfies (10.18). So The- 
orem 6.15 applies, and F/C and F are universal Donsker classes. 


Corollary 10.14 Jf a VC subgraph class F of measurable functions is uni- 
formly bounded and image admissible Suslin, then it is a universal Donsker 
class. 


Proof. This follows from Theorems 4.53 and 10.13. 


Specializing further, the set of indicators of an image-admissible Suslin VC 
class of sets is a universal Donsker class (Corollary 6.19 for F = 1). 

For a class F of real-valued functions on a set X, recall from Section 
4.7 the class H(F, M) which is M times the symmetric convex hull of F, 
and H,(F, M) which is the closure of H(F, M) for sequential pointwise 
convergence. Note that for any uniformly bounded class F of measurable 
functions for a o-algebra A and any law Q defined on A, H(F, M) is dense in 
H,(F, M) for the L?(Q) distance (or any L?(Q) distance, 1 < p < 00). 


Theorem 10.15 Jf F is a universal Donsker class of measurable real-valued 
functions on a measurable space (X, A), then for any M < œ, H,(F, M) is a 
universal Donsker class. 


Proof. Let G := {f —inf f: f € F}, as in the Remark after Proposition 
10.11. Then functions in H(¥, M) differ from functions in H(G, M) by additive 
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constants. If hy € H,(F, M), hy > h pointwise, and for all k, hy = bk + Ck 
for some x € H,(G, M) and constants cz, then since oy are uniformly bounded 
and / has finite values, c are bounded. So, taking a subsequence, we can assume 
ce converges to some c. Then ¢; converge pointwise to some ¢ € H,(G, M) 
and h = ¢ + c. So all functions in H,(F, M) differ by additive constants from 
functions in H,(G, M) (the converse may not hold). Thus, if H,(G, M) is a 
universal Donsker class, so is H,(#, M) by Proposition 10.11(b). So we can 
assume F is uniformly bounded. Then it has an envelope function in L*(P) for 
all P, so by Theorem 3.41, H,(F, M) is a universal Donsker class. 


Remark. We already saw in Theorem 6.21 that if F is a uniformly bounded VC 
major class F for a VC class C of sets, such that C is image admisssible Suslin, 
then F is a universal Donsker class. 


By Theorem 10.13 above, for any ô > 0, if log D® (e, F) = O(1/e?-*) as 
€{0, and if F is image admissible Suslin, then F is a universal Donsker class. 
In the converse direction we have: 


Theorem 10.16 For a uniformly bounded class F to be a universal Donsker 
class it is necessary that 


log De, F)y= O(e~*) as €\0. 


Proof. Suppose not. Then there are a universal Donsker class F and £x 0 such 
that log D? (ex, F) > k? /e? for k = 1, 2, ..., so there are probability laws Py 
with finite support for which log D® (e, F, Py) > k’ /e? for k = 2,3, .... Let 
P bea law with P > X &, P:/ k?. Then for any measurable f and g, 


S- dP)? > (S — g)dPx)'?/k. 
Let 6, := &/k. Then 
log D?(6,, F, P) = log D?(e, F, P) > Ke /e? = k/8?. 


So any isonormal process L on L?(P) is a.s. unbounded on F by Theorem 2.14. 
We can write L( f) = Gp(f)+ G f fdP where G is a standard normal variable 
independent of Gp. Since F is uniformly bounded, G p is a.s. unbounded on 
(a countable pp-dense subset of) F, so F is not P-pregaussian, and so not a 


P-Donsker class. 


Theorem 10.16 is optimal, as the following shows: 
Proposition 10.17 There exists a universal Donsker class E such that 


lim inf 5° log D(5,€) > 0. 
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Proof. Let A; := A(j) be disjoint, nonempty measurable sets for j = 1,2,.... 
Let || - |lz be the £? norm, ||x||2 = È; “ye for x = {xj}F2). Let 


E = 4) lag: lxll2 <1 
J 


(So € is an ellipsoid with center 0 and semiaxes 1 4;)-) Let P be any probability 
measure defined on the A; and let p; := P(A;) for j =1,2,.... We can 
assume that p; > 0 for all j, since if B is the union of all A; such that p; = 0, 
then P(B) = 0, Gp(B) = 0 a.s. and v,(B) = 0 a.s. for all n. 
Let ¢ > 0. For any k and n, let |[vallan = ÈS vn(A;)?)/?. Then for 

all n, 

[e6] 

Elive < Xo — 0 as k —> œ. 
j=k 


Take k = k(e) large enough so that ee Pj < £?/18. Then 


Pr{|lVallon > €/3} < £€/2. (10.19) 
If livall2 < €/3, lxl2 < 1 and |lyll2 < 1, then by the Cauchy (—Schwarz) 
inequality, 
CO 
Ma | Doles — yDlaw || < 26/3. (10.20) 
jak 


Also, E||Vn|l2,1 < (Ellvnll3,)'/? < 1, so 
Pr{|[Vnllo1 > 2/e} < €/2. (10.21) 


Let ô := (minja pj) ?e?/6 > 0. Let fe = S021 x;laqy for each x 
with ||xllz2 <1, so fx € E. If fx and fy E€ E and ep(fr, fy) «= USA - 
f Y dP)! < 6, then 
1/2 

Yay? < &?/6. (10.22) 

j<k 
By (10.21) and Cauchy’s inequality, | pn; — yj) (A;)| < €/3 for all x 
and y such that ep(f,, fy) < 6, except on an event with probability < ¢/2. 
Thus by (10.19) and (10.20), there is an event F with Pr(F) < € such that if 
w ¢ F then 


for all x and y with ep( fx, fy) < 5, lun( fx — fp) < €. (10.23) 


Given x with } x? < 1, let yj = x; for j < k and y; = 0 for j > k. Then 
Sh - fy aP)? < £/4. Since x > fy is continuous from £? into L?(P), 
the set of all f, such that yi x? < landx; =Ofor j > kis compactin L?(P). 
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It follows that € is totally bounded in L?(P). Thus by (10.23) and Theorem 
3.34 for tT = ep, E is a Donsker class for P. Since P was arbitrary, € is a 
universal Donsker class. 

Next, given0 < ô < 1/2,letm := [1/(467)], where [x] is the largest integer 
< x. Let P be aprobability measure with P(A;) = 1/m for j = 1,...,m.Then 
in L?(P), E is an m-dimensional ball with radius m~!/?. Letr := D®(8, E, P) 
and let g1,..., 8, € E withep(g;,g;) > dforl <i<j<r.LetB; := {fx : 
ep( fx, gi) < 25}. Then 

1/2 


(Bz: Dit e X a?/m <m! 46, x; =Oforj >m 


Thus by comparing volumes of balls in R”, we get by choice of m, 


D®(6,€, P) =r > (m! + 8y"/(28)" = (50 + 1am!) > B/J)” 


> : exp (108 G) 1148?) 


Letting 6/0, Proposition 10.17 follows. 


Proposition 10.18 There is a uniformly bounded class F of measurable func- 
tions, which is not a universal Donsker class, such that 


log De, F) < as e{0. 


2 
e? log(1/e) 


Proof. Let B; := B(j) be disjoint nonempty measurable sets. Recall that 
Lx := max(1,logx). Leta; := 1/(jLj)'””, j > 1, and 


[o0] 
f= Yo jleg: x; = a; for all j 
j=l 
Take c such that La pj = 1, where p; := c(a;/LLj)*. Here Pi <œ 
by the integral test since 
(d/dx)\(1/LLx) = —1/(«Lx(LLx)’) for x > e°. 
Take a probability measure P with P(B;) = pj; for all j. Leta := (_— 


p\)'/?/2 > 0. Then 


E\|Gpllz = > ajE|Gp(B))| = X o/m opa- pj)? 


j=l j=l 


[e6] [e6] 
za) op) > acl?) 1/(JLjLLj) = +00 
j=l j=l 
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by the integral test since (d/dx)(LLLx) = 1/(xLxLLx) for x large enough. If 
F were P-pregaussian, then since Gp can be treated as an isonormal process 
(see Section 3.1), by Theorem 2.32 ((a) if and only if (h)) and the material 
just before it, Gp could be realized on a separable Banach space with norm 
|| - |F. Then the norm would have finite expectation by the Landau—Shepp-— 
Marcus—Fernique theorem (Theorem 2.6 above), a contradiction. So F is not 
pregaussian for P and so not a universal Donsker class. 
For any probability measure Q, r = 1,2,..., 


T= X jleg c€ F and g = X vila EF, 
J J 


if x; = yj forl < j < r, then 


1/2 


oo 2 oo 
eo(f,g) = Ie - Diao) do| <|} 4050(B)) |"? < 2a, 
j=r j=r 
(10.24) 
Given € > 0, let r(£) be the smallest integer r > 1 such that a, < ¢/2. By 
(10.24), if x; = yj for 1 < j <r, then ep(f, g) < £. Thus since there are only 
2’! possibilities for x; = +a;j, j <r, we have D?(e, F, Q) < 2’~!, and 
since r does not depend on Q, D® (e, F) < 2’~! and log De, F) < r log2. 
As €{0, we have a,(¢) ~ €/2, so 


1 
log(1/e) ~ log(2/¢) ~ log(1/a;(e)) ~ 5 log(r(é)), 
and e/2 ~ 1/(r(e)- 2log(1/e))'/?, so r(e) ~ 2/(e? log(1/e)). Since 
log D(e, F) < r(e)log2 < r(e), 


the conclusion follows. 


Theorems 10.13 and 10.16 show that the condition (10.18) comes close 
to characterizing the universal Donsker property, but Propositions 10.17 and 
10.18 show that there is no characterization of the universal Donsker property 
in terms of D®. 


10.3 Metric Entropy of Convex Hulls in Hilbert Space 
Let H be a real Hilbert space and for any subset B of H let co(B) be its convex 
hull, 


k k 
co(B) = 49 tx: 20, $ t =1, xj €B, k=1,2,... 
j=1 


j=l 


Recall that D(e, B) is the maximum number of points in B more than € apart. 
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Theorem 10.19 Suppose that B is an infinite subset of a Hilbert space H, 
|x|| < 1 for all x € B and that for some K < œ and0 < y < œ, we have 
D(e, B) < Ke” forO <e <1. Lets := 2y/(2+ y). Then for any t > s, 
there are constants Cı and C2, which depend only on K, y and t, such that 


D(e, co(B)) < Cyexp(C2¢e~') for 0<e <1. 


Note. Both van der Vaart and Wellner (1996, Theorem 2.6.9) and Carl (1997) 
give the sharper bound with f = s. 


Proof. We may assume K > 1. Choose any x; € B. Let n > 2 and sup- 
pose given B(n) := {x1,..., Xn—-1}. Let d(x, B(n)) := minye gn) ||x — yl| and 
ôn (= SUPyep A(x, B(n)). Since B is infinite, 6, > 0 for all n. Choose x, € B 
with d(x,, B(n)) > 6,/2. Then for all n, K(2/6,)” > D(6,/2, B) > n, so 
ôn < Mn!” for all n where M := 2K!/", 

LetO < £ < 1.Let N := N(e) be the next integer larger than (4M /¢)”. Then 
ôn < €/4. Let G := B(N). For each x € B there is ani = i(x) < N — 1 with 
|x — x;|| < ôy. For any convex combination z = yee Zxx Where zx > 0 and 
boars: Zx = 1, with zx = 0 except for finitely many x, let zy := seer Wide 
Then ||z — zy || < ôn < €/4, so 


D(e,co(B)) < D(e/2, co(G)). (10.25) 


To bound D(e/2,co(G)), let m := m(e) be the largest integer < e~*. Note that 
y > s. Then for each i with m <i < N, there isa j < m such that 


lxi — xyll < m+ < Me”. (10.26) 
Let j(i) be the least such j. Let An := {{Aj}i<j<m: Aj = 9, Da Aj = I}. 
On R” we have the £, metrics 


1/p 
m 


Plh ty) = | do -y 
j=l 


By Cauchy’s inequality, pı < m!/? p2. Let B := £/6 and 5 := B/(2m!'/”), The 
6-neighborhood of An for p2 is included in a ball of radius 1 + 6 < 13/12. We 
have D(26, Am, 02) centers of disjoint balls of radius 6 included in the neigh- 
borhood. Comparing volumes of balls and recalling that Lx := max(1, log x) 
gives 


D(B, Am, P1) < D(28, Am, p2) < 13"m™/2e—™ 
exp(m{L(1/e) + (Lm)/2 + log(13)}) (10.27) 
exp(C3e “L(1/e)) < exp(Cue'), O<e<1, 


IA IA 


IA 


for some constants C3, C4. 
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For each j = 1,...,m, let Aj := A(j) consist of x; and the set of all 


xi, i=m-+1,..., N, such that j(@) = j. Then G = U Aj and the A; are 
disjoint. Take a maximal set S = S(e) C Am with p;(u, v) > £ for any u Æ v 


in S. For a given à = (Aj,..., Am) E S, let 


F, := {x €co(G): x= ye where uy x = 0 and 
yeG 


XO yey, forall j=1,...,m}. 
yeA(j) 


For any x €co(G), let x = ce Uy xy Where uy x > 0 and for t;(x) := 
Beag Hy,x T(x) = {1&1 € Am. Take A(x) € S such that yi |Aj(x) — 
tj(x)| < B. For y € A(j), define vy, := Aj(X)My,x/tj(x) if t(x) > 0. If 
T(x) = 0, choose a w € A(j) and define vy, := Aj;(x) and vy, := O for 
y Æ w, y € A(j). If t)(x) > 0, then 


XO ys vyl = D> My all = AHOY = IE) Aw]. 


yeA(j) yeA(j) 
If t;(x) = 0, then yeah |My x — Vy x| = Vw,x = à jŒ) = |ti) — à; (x)|. It 
follows that ) veg |My — Yyxl = Plyaj) < p. Set zx = 
De Vy xy. Then zx € Fy), A(x) € S, and ||zx — x|| < B. Thus by (10.25) 
and (10.27), 


D(e, co(B)) < D(e/2,co(G)) < D(B,|_] Fy) 
AES 


< card(S) max D(p, Fy) < exp(C4e") max D(B, Fy). (10.28) 


To estimate the latter factor, let à € S. We may assume A; > 0 for all 
j. For any j=1,...,mandx eM, letx := Do acy My,xy- Let Y; be 
a random variable with values in A(j) and P(Y; = y) = py,,/Aj; for each 
y € A(j). Then EY; = x) /à; =: zj. Take Y1,..., Yin to be independent and 
let Y := D0", ÀjYj. Then EY = x and 


m m 
EY = xl? = EIX NO =z = $ KENY — zl, 
j=l j=l 


since Y; — zj are independent and have mean 0, and H is a Hilbert space. 
Now the diameter of A(/) is at most 2Me°!V by (10.26), and z; is a convex 
combination of elements of A(j). Thus 


EY; = 2)? = 47’ Do bys fay) DS eoa) < 4M? 2, 
yeA(j) zEA(j) 
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and for any set F C {1,..., m}, 
2 
24192223 2 2s 
E Di =z) = dja eY < 4M (nA s/y, 
JEF j¢F 

Next, an idea of B. Maurey will be applied. For k=1,2,..., let 
Yji, Yj2,..., jx be independent with the distribution of Y;, and with Yj; 


also independent for different j. Then 


2 

k 

E Doak Y O-z) < 4M *(max à ;)e™/ /k, 
jeF i=l 


sO 


k 
EV Ae! S Yz < 2M (max Aj)" e" K", 
jeF i=1 


Thus, there exist yj; € A(j), i=1,...,k, j € F, such that 


k 
f —1 . ae 1/2 o8/¥ 1/2 
So Aik Xy) zp < PMA h el’ /k/!*, (10.29) 
jeF i=1 

Take v > 0 such that s +v < t. Let F(0) := {j <m: àj = £”}. Let k(0) be 
the smallest integer k such that 


k > 6400M???” = 6400M7e~°. 


For k > k(0) and F = F(0) the expressions in (10.29) are at most £/4 > 0. 
Let r be the smallest positive integer such that £” /4" < (e!~*/” /(80M)}?. 
For u = 1,2,...,r, let 


F(u) = {j <m: e” /4" < Àj Pa e” /4"7}}, 


and let k = k(u) be the smallest integer k such that 27-“Me‘/’/k!/? < 
€/(40r), i.e., k > 100M24*-"22—2+25/” Thus, for some constant Cs, k(u) < 
1+ Cs4-“e-(L(1/e))°. The yj; for F = F(u) will be called oy (they also 
depend on x). 

LetF(r +1):= {j <m: àj < €?/4"}.Letk +1) := 1.For F = F(r + 
1) and k = k(r + 1), (10.29) is bounded above by ¢/40. We have y',"”, a single 
choice for each j. Let 


r+1 1 k(u) 


=J, D Yi- 
i=1 


u=0 jeF(u) 
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Then by (10.29) and the results for u = 0 and u =r + 1, 


r+l k(u) 


It- = |> > hi DOP =a 
isi 


u=0 jeF(u) 


E E = 
as, eam 2M(e? qu-l 1/2 -s/y k 1/2 
"EERE? (e”/4" 1176817 /k(u) 


E E 3e E 
< +r: = < 


Here ¢ is determined uniquely by the k(u)-tuples oS Jeaan AN je Fl), u 
=0,1,...,r + 1. Each A(j) has at most N elements, so that for given u < r 
and j < m, there are at most N*” ways of choosing the yip. Now card(F(u)) < 
4" /e”, so the number of ways to choose the ay for given u with 1 < u <r is 


at most exp{(log N)(4“e7” + Coe" LL fey yh There are at most 


exp((log N)[e~*~"6400M? + €™”]) 


(0) 
ij 


. Thus, the total number of ways to choose all the ye 


ways to choose the y ij 


gives 
D(e/6, Fy) < exp(Co{e* L(1/e)* + L(1/e)4" /e”}) 


for some Cg := C6(K). By definition of r, 4” /e” < C7628/7-2 = Cje™ for 
some C7 := C7(K). Thus, D(e/6, Fy) < exp(Cge~‘) for some Cg. Combining 
with (10.28) completes the proof of Theorem 10.19. 


Example. The exponent 2y /(2 + y) in Theorem 10.19 is sharp in the following 
example. Let {e,},>1 be an orthonormal basis of H, and for 0 < y < ov let 


B := {n en}nz1 U {=n en}nz1. 


For any ¢ > 0 small enough we have D(e, B) = 2n for the least n such that 
(n? +(n+1)?)? < e: the points +j—-'/%e;, 1 < j <n, are more 
than £ apart, so D(e, B) > 2n, while a set of points of B more than ¢ apart 
cannot contain jej or —j Ve; for more than one value of j > n, so 
D(e, B) < 2n. Thus as £ > 0, £? ~ Dini’, so for a constant C = C,, 2n ~ 
C/e” and replacing C by a suitable larger K if necessary, the hypothesis of 
Theorem 10.19 holds. 

Let B, be the intersection of B with the linear span of e),...,e,. Let 
C, = co(B,,). Then for any £ > 0, D(e, co(B)) > Die, C,,). The n-dimensional 
volume v,(C,,) is 


Un(Cn) = G2 = fnt, 
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A ball B(x,r) := {y: |y—x| <r} in R” has volume v,(B(x,r)) = cpr” 
where c, = aes | T((n + 2)/2). (This is well-known, especially for n = 1, 2,3 
since (1/2) = x !/?, and then can be proved by induction from n to n + 2). If 
X1,...,Xm are points of C, more than € apart, for a maximal m = D(e, Cn), 
then the sets B(x;, €) cover Cy, SO MCE” > v,(C,). By Stirling’s formula 
(Theorem 1.17), as n — oo 


Un(Cn)/Cn ~ (en HY Qany Yt WEY rn 2e P (en)? 


= (e[n OVCO ayn Vey) pD, 


for a constant D,,. 

Take any d suchthat d > 5 + ma Then for n large enough v,(C,)/¢n = n”, 
Let g,(£) := nT je. Then m > g,(é). 

The following paragraph is only for motivation. As ¢ | 0, n = n(e) will be 
chosen to make g,(£) about as large as possible. We have 


1 ie aa 1 
Bn41(€)/8n(€) = en + ert nan ~ ind (1+ z) ~ ney“. 
E n E 


This sequence is decreasing in n, so to maximize g,(€) for a given £ we want 
to take n such that the ratio is approximately 1. 

At any rate, let n(e) be the largest integer < f(e) := e7'e7!/¢. Then for € 
small enough 


d (1\ "4 
Bn(elE) = fley OSIO = eexp(df(e)) = eexp > (<) ; 


E 


Nowe > exp(—e~°) ase | Oforany ô > 0. Taking ô < 1/d andletting d | (y + 
2)/(2y) we have 1/d t (2y)/(2 + y), showing that the exponent in Theorem 
10.19 is indeed best possible. Recall the definitions of D® from Section 4.8 
and H, from after Corollary 10.14. 


Corollary 10.20 [fG is a uniformly bounded class of measurable functions and 
for some K < coand0 < y < œ, D® (e, G) < Ke™” for0 < e < 1, then for 
anyt >r := 2y/(2 + y), and for the constants C; = C;(2K, y,t), i = 1,2, 
of Theorem 10.19, 


D? (e, H,(G,1)) < Ciexp(C26™") for 0< e< 1. 


Proof. We have DË (e, Gg U =G) < 2Ke™”, 0 <€< 1. Thus for any law 
Q with finite support, D® (e, g U =G, Q) < 2Ke™”. By Theorem 10.19, 
D®(e, H(G, 1), Q) < Cyexp(C2e~') for C; = C;(2K, y, t). It is easily seen 
that H(G,1) is a dense subset of H,(G,1) in £7(Q). The conclusion 
follows. 
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Now, recall the notions of VC subgraph and VC subgraph hull class from 
Section 4.7. 


Corollary 10.21 Jf G is a uniformly bounded VC subgraph class and M < œ, 
then the VC subgraph hull class H;(G, M) satisfies (10.18). Also, if H is a 
uniformly bounded VC major class, then H satisfies (10.18). 


Proof. First let G be VC subgraph. Let MG := {Mg : g € G}. Then MG is 
a uniformly bounded VC subgraph class. By Theorem 4.53(a), MG satisfies 
the hypothesis of Corollary 10.20, so r < 2 and we can take t < 2, and the 
conclusion holds. 

If H is VC major for a VC class C of sets, let F := {lc : C € C}. Then F is 
a VC subgraph class by Theorem 4.51(a) and H is VC hull, thus VC subgraph 
hull by Theorem 4.51(b), so the first part of the proof applies with F in place 
of G. 


Remark. It follows by Theorem 6.15 that for a uniformly bounded VC subgraph 
class G, if F c H,(G, M), in particular if F is VC major or (thus) VC hull, 
and F is image admissible Suslin, then F is a universal Donsker class. This 
also follows from Corollary 10.14 and Theorem 10.15. 


Example. Let C be the set of all intervals (a, b] for O< a< b < 1. Let G 
be the set of all real functions f on [0, 1] such that |f(x)| < 1/2 for all x, 
| f(x) — fO) < |x — y| for0 < x, y < 1, and f(x)=0forx<0orx>l1l. 
Each f in G has total variation at most 2 (at most | on the open interval 0 < 
x < land 1/2 at each endpoint 0, 1). By the Jordan decomposition we have, for 
each f € G, f = g — h where g and A are both nondecreasing functions, 0 for 
x < 0. Then g and h have equal total variations < 1 and G C H,(C, 2) by the 
proof of Theorem 4.51(b). Let P be Lebesgue measure on [0, 1]. By Theorem 
8.7, and since (f |f| d P)! > f|f|dP, there is a c > 0 such that D™(e, G) 
> e°/* as e | O (consider laws with finite support which approach P). Since 
S(C) = 2, the exponent y can be any number larger than 2 by Corollary 4.4 and 
Theorem 4.47. Letting y | 2, t in Corollary 10.20 can be any number > 1, and 
we saw above that it cannot be < | in this case, so again the exponent is sharp. 


10.4 Uniform Donsker Classes 


A class F of measurable functions on a measurable space (X, A) is a uniform 
Donsker class if it is a universal Donsker class and the convergence in law of v, 
to Gp is also uniform in P. Giné and Zinn (1991) gave a precise formulation 
of the uniformity in terms of the dual-bounded-Lipschitz distance 6 as defined 
just before Theorem 3.28, and gave a characterization of the so-defined uniform 
Donsker property of F, to be stated and proved in this section. 
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Let (X, A) be a measurable space. Recall that P(X) = P(X, A) is the set 
of all probability measures on (X, A). Let F be a class of real-valued mea- 
surable functions on X. By Proposition 10.10, if F is universal Donsker, then 
Fo := {f — inf f : f € F} is uniformly bounded. By Proposition 10.11, F is 
Donsker for a given P if and only if Fo is, and the same holds for universal 
Donsker and uniform Glivenko—Cantelli. Once we define uniform Donsker, we 
will see that it also holds for uniform Donsker. 

As in the definition of Donsker class (near the end of Section 3.1) let 
£°(F) be the set of all bounded real-valued functions on F. For any two 
bounded functions G and H on F, let d(G, H) := dF(G, H) := ||G — A || F. 
In Theorem 3.28 on equivalence of convergence in distribution and convergence 
for the metric 6 or p, for functions fm and f from a probability space Q into 
L~°(F), let the metric space S be €°(F) and the metric d = dr. Then £ will 
be written as Bz. Recall that F is not a metric in the usual sense, in that 
Br( fm, fo) is defined only when fo has separable range and is measurable. 

Let P(X) be the set of all laws P in P(X) such that P(F) = 1 for some 
finite F. In the (unusual) case that finite sets are not in the o-algebra A, we 
can express this by saying that for some numbers c, > 0 for all x € F with 
Per Cx = 1, P = } ep Cxôx. Such a P is defined on all subsets of X and so 
in particular on any o-algebra. 

For a pseudometric d on a class F, such as an L? distance ep or pp, and 
ô > 0, let 


F'(5,d) :={f—g: f,g € F, d(f, 8) < ô}. (10.30) 


Then for d = ep or pp, F'(ô, d) is image admissible Suslin by Corollary 5.18 
and Theorem 5.26. 


Definitions. A class F is uniformly pregaussian (UPG) if it is pregaussian for 
all P € P(X), and if, for a coherent version of Gp for each P, we have both 


sup E||Gp||z < co (10.31) 
PeP(X) 
and 
lim sup E||Gp|l7(s,o,) = 0. (10.32) 
510 pep(x) 


The class F is finitely uniformly pregaussian (UPG +) if the same holds with 
P(X) in place of P(X), namely 


sup E||Gpl|z < oo (10.33) 
PEP; (X) 
and 
lim sup E|/Gp|lF(s,op) = 0. (10.34) 
519 Peps (X) 
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The class F is a uniform Donsker class if it is uniformly pregaussian and 


lim sup Fn, Gp) = 0 (10.35) 
n> PEP(X) 
where ßp is the dual-bounded-Lipschitz “metric” 6 based on ||- || 7 as in 


Theorem 3.28. 
The following, due to Giné and Zinn (1991), will be proved: 


Theorem 10.22 Let(X, A) be a measurable space and F an image-admissible 
Suslin class of real-valued measurable functions on X. Then F is a uniform 
Donsker class if and only if it is finitely uniformly pregaussian and thus, if and 
only if it is uniformly pregaussian. 


Remarks. The theorem is a very useful characterization since it is easier to 
check the finitely uniformly pregaussian property than to check the uniform 
Donsker property directly. 

Giné and Zinn showed that Pollard’s entropy condition (10.18), together 
with uniform boundedness and measurability for F, which imply F is univer- 
sal Donsker by Theorem 10.13, actually imply that F is uniformly Donsker 
(Theorem 10.26 below). Thus most of the examples of universal Donsker 
classes treated in Sections 10.2 and 10.3 are uniformly Donsker. An exception 
is the “ellipsoid” universal Donsker class of Proposition 10.17; see Problem 4 
below. 


Proof of Theorem 10.22. First, some uniformity in finite dimensional cases will 
be helpful. Recall the space B L(IR“) of bounded Lipschitz functions on R? and 
its unit ball 


BL\(R*) := {f € BL(R®): |ifllaz < 1}- 


Let Bz be the usual bounded Lipschitz distance for laws P, Q on R4, namely, 
Ba(P, Q) := sup{| f fd(P — Q)|: f € BLi(R®)} (RAP, Prop. 11.3.2). 

As in Section 8.2, for a multi-index p = (pı, ..., pa) where p; are non- 
negative integers and [p] := pı +---+ pa, let x? := Mi A and if f isa 
suitably differentiable function, let D? f := ol?! f/3x?" --- 3x4". Let Cc} be the 
set of functions f on R? such that D? f exist and are bounded and continuous 
for 0 < [p] < 3, where D° f := f, and for which ps3 ILD? f \lsup < 1. For 
probability measures P and Q on Rf let 


dx(P, Q) := sup{| f fd(P — Q)|: f € CÌ}. 


Let S be a separable Banach space and P(S) the set of all Borel probability 
measures on S. Recall that for any P, Q € P(S), the convolution is defined 
by (P x Q)(A) := f P(A — x)d Q(x) for any Borel set A C S. Convolution 
is commutative and associative. The following fact and proof, given here for 
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completeness, are as in Araujo and Giné (1980, p. 37); on p. 67, and in Giné 
and Zinn (1991), Lindeberg’s name is associated with it. 


Lemma 10.23 Let S be a separable Banach space and F a uniformly bounded, 
image admissible Suslin family of real-valued Borel functions on S which 
is translation invariant, i.e., for each f € F and u € S, the function x re 
f(x — u) is in F. Let P; and Q; be in P(S) for j = 1,...,n. Then 


[CPi * Pz +++ Pa) — (Qi * Qo # +++ * Onlie < XO IP; — Qille. 
j=1 


Proof. It will suffice to treat n = 2, as then one can use induction. We have 


| f FAP,» Pa- Or Oadls 
= 1 f f Fæ + APAPO) - dOaerd QiyIs 
<i f f Fæ DAP - odo) 
+i f f Fæ DAP- ododo 


< fif rar- anisare+ fi f far,- Old Qi 


= ||Pi — Qillz + || Po — Qollr. 


proving the Lemma. 


For S = R, BL, (R“) and CÌ satisfy the hypotheses on F in Lemma 10.23. 


Lemma 10.24 For0 < M < œandd =1,2,..., let P3 be the class of Borel 
probability measures P on R with P(\x| < M) = 1. For P € P3, let {XP }>1 
be i.i.d. (P) and let Cp be the covariance for P. Then we have for a constant 
K = Kg depending only on the dimension, 


d; E (Sar — exh/va) , N(O, cn] < KMTr(Cp)//n_ (10.36) 


i=1 


and 


1 n 
lim sup Bg | £ | — (X? — EX?) , N(O, Cp) | = 0. (10.37) 
n> pepi, l Jn 2 
Proof. For (10.36), first, it is well known that for any covariance matrices C and 
D, N(0, C)» N(0, D) = N(0, C + D), and this can be iterated to any number 


of terms. Let Y; be i.i.d. N(O, Cp/n), &i := (XP — EX yin, i=1,...,n, 
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and é := {&}'_,. Thus by Lemma 10.23, 
d3 (< (È e) , N(O, cn) < nd; (L(§1), N(0, Cp/n)). (10.38) 
i=l 
If f € CÌ, then for p! := []4_, p;! 


D? f(O)E? 
Fé) =06)+ Y a (10.39) 


[p]<2 


where the remainder 0(&;) < Jal&i 3 for a constant J4 depending only on the 
dimension d, and similarly for f (Y;). Here Eż; = EY; = 0, EE? = EY} when- 
ever [p] = 2, and 


E\é>) < 2ME|€|?//n = 2M Tr(Cp)/n?”. (10.40) 


Take coordinates in which Cp is diagonalized. For each coordinate yj := Yj; 
of Yı we have E(y;) = o? := EEP), and 


d 
Tr(Cp) =n X0? = nE(E P) < EXP) < M°. 


j=l 
We have 
d 
EYI = EO? +--+ yy) =D EG)+2 Yo) ofo. 
j=l I<i<j<d 

For each j, E(y;) = 307 < 3M?o7/n. It follows that 

E\Y\|* < 3M? Tr(Cp)/n? + (Tr(Cp))”/n? < 4M? Tr(Cp)/n?. 
Then by the Cauchy—Bunyakovsky—Schwarz inequality 

EMP = EK) 1Yil) < VEY lt Te(Cp/n) < 2M Tr(Cp)/n*. (10.41) 


We then have, summing terms in the Taylor series (10.39) for & and the 
corresponding ones for Y;, since E(E?) = E(YP) for [p] < 2, that by (10.40) 
and (10.41), 


(ELFE) — FDI < 4M Ja Te(Cp)/n?”, 


which gives (10.36) with Kg := 4Jq. 

Next, to bound the fy distance in (10.37), let f € BL, (RZ). Smooth f by 
convolving it with the N(0, eI) density ¢,, namely, for ọ := 1, set f,(x) := 
f f(x — &y)o(y)dy. Then for some constant c(d) 


IF — fellsup < 27) J (min(2, ely|) exp(—ly|?/2)dy < c(d)e > 0 
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as £ }0. By the changes of variables u = x — ey, so that y = (x — u)/£, we get 
f(x) = f f@)o(x — u)/e)du. We can then differentiate under the integral 
sign (Appendix A) with respect to coordinates of x. For any multi-index p, 
ID? fells < e7”! f |D?o(y)|dy. Thus, f can be approximated as well as 
desired in sup norm by a function having bounded, continuous derivatives up 
to any given degree, in our case [p] < 3. Thus (10.37) and and Lemma 10.24 
follow. 


Next, here are some definitions. For any nnds (nonnegative definite sym- 
metric) matrix M, there is a unique nnds square root A = M!/?, namely, a nnds 
matrix A with A? = M. For any d x d matrix M, define the Hilbert-Schmidt 
norm ||M||2 as the square root of, for any two orthonormal bases {e et and 


cae 


d 


d d 
M3 := D> Me; = D> > Mej, fi”. 
j=l i=1 


j=l 


This does not depend on the choice of orthonormal basis {e;} since (Mej, fi) = 
(Mf;,e;). For (S,d) a separable metric space, 1 < p < oo, and two Borel 
probability measures P and Q on S, such that f d(xo, x)Pdu(x) < co 
for some (any) xo € S, the Wasserstein (p) distance is here defined by 
W,(P, Q) := inf{d(x, Y)?: LX) = P, L(Y) = Q}!/P. (Sometimes in the 
literature “Wasserstein distance” refers to the special case p = 1, and other 
times to quantities ø (x, y) other than pth powers of a given metric.) The Fréchet 
distance between P and Q is W2(P, Q). We have W,(P, Q) < W,(P, Q) 
for 1 < p < q < œ by the Holder inequality. Recall || f |z := supry | f(x) — 
fO/dx, y). For p = 1, W(P, Q) := sup{| f fd(P — O)| : Ifill < 1} by 
the Kantorovich-Rubinštein theorem (RAP, Theorem 11.8.2). Clearly, the 
bounded Lipschitz distance (P, Q) < Wı(P, Q). We next have: 


Lemma 10.25 Let C and D be two d x d covariance matrices and N(0, C) 
and N(0, D) the corresponding normal laws on R¢. Then 


Ba(N(O, C), N(0, D)) < Wi(N (0, C), NO, D)) 


10.42 
< WN, C), NO, D)) < IVC — VDI). Sra 


Proof. Let Z have N(0, Iz) distribution on R¢. Then /CZ has N(0, C) and 
«DZ has N (0, D). We have 


1/2 
W(NO, ©), NO, D)) < [EUWE = VDDID| = IVE = VD. 


The rest follows from the discussion preceding the Lemma. 
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Now, to prove Theorem 10.22, since clearly uniform Donsker implies UPG ¢ 
(finitely uniformly pregaussian), it will suffice to prove the converse, which will 
be done in a series of claims. Let F € U PG p with F admissible Suslin. 


Claim 1. It suffices to consider classes F uniformly bounded by 1. 


Proof of Claim 1. For x Æ y in X let P(x, y) := + (5x + dy). Then 


sup EllG payllF < 00 
x,yEX 


implies sup se z(diam( f))?/4 = SUP per EG (f) < oo. Thus the class 
Fo := {f — inf f : f € F} is uniformly bounded as in Proposition 10.3, so we 
can replace F by Fo. For M := sup{|lgllsup : g € Fo} we can further replace 
F by {f/M: f € Fo} without loss of generality, getting a class bounded by 
1, proving Claim 1. 


Let F? := {fg: fig e F}, F =({f—e: fig € F}, and F? := {h’: 
he F}. 


Claim 2. Let G := F U FZ UF’ U (FF. Then SUPpepx) EP|| Pn — Pilz = 
O(n"). 


Proof of Claim 2. The class G and each of the given subclasses of it are 
image admissible Suslin by Theorem 5.26. It suffices to consider G = F”. For 
tf. 3,0. € F, by Claim 1, we have 


(f —g° —@-wvy) =(f -g-@-wrf-gto-wy 
< 16(f -g-(-w)y, 


and so 


Ep, ICF — 27 — ($ — YY] < 16Ep,[(f — 2) -(@— y). (10.43) 


Take a product of our probability space (X°, A9%, P%) with a copy of [0, 1] 
with its usual o-algebra and with U[0, 1] distribution. On [0, 1] define a 
sequence {g;};>1 of i.i.d. N(O, 1) random variables. Define a process for h € G 
by G(h) := yj g;h(X;)//n. For fixed w € X” and thus X,,..., Xn, and 
where E, denotes expectation with respect to the distribution of g; only, it will 
be shown that 


E; < 8E, 


(10.44) 


J giôx,/ [vn 
i=l 


J giôx/ [vn 
i=1 


(F'y? F' 


P1: KNP Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 
CUUS2019-10 CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 16:25 


10.4 Uniform Donsker Classes 379 


Let T := F'. For t := h = f — g with f, g € F, let Y, := G(h) and s = H = 
ġ— y, p, y € F. Then 


2 
1 <2 
E (Œ, —Y,*)=E||— (h — HX X; 
(Œ, — Y)°) Foe X ] 


1 n 
= 9 0- HY(Xi) = Pr((h — H). 
i=l 


Let X, := Xp := EYw. Then we have 


2 
1 
16E ((X; — X’) = E È 3 ` gi (h? — R°) a| 


II 
| 
— 
~~ 
d 
N 
l 
i 
— 
~ 
X< 
aat 
— 
N 


II 
> 
N 


P, ([ 2 a) < 16P, ((h — H)?) 


by (10.43). So E ((X; — X°) < E ((Y; — Y,)*), amain hypothesis of Theorem 
2.18. Taking a countable dense subset { f;}?°, in F with respect to L?(P,), 
S := {fi — fj}ij=1 is countable and dense in T = F’, and inf; ; Xf- =0 
a.s., so by (2.21), which Giné and Zinn call the Slepian—Fernique Lemma, we 
do get (10.44). 

Now suppose, as one may, that beside {g;};~1, on [0, 1] we also have defined 
i.i.d. Rademacher variables {¢;};>) independent of {g ;} ;>1. For any nonnegative 
random variable Y we have EY = is Pr(Y > x)dx. Thus (the next inequality 
is well known to hold with a factor of 2 rather than 4, but 4 suffices for present 
purposes) by desymmetrization (10.6) 


1 n 
ENP, — Ply < 4E |- 2 ids, (10.45) 
{=l (F'} 
Next, {¢;|gi|}7_,; are equal in distribution to {g;}/_,, and so we have 
n n n 
E X gidx, =E X eilgildx, > ElgilE X eidy, 
i=] (Fy j=l (FÈ i=l (FÈ 


by Jensen’s inequality, for example Lemma 9.7(a), applied to integration Eg 
with respect to {g;} only for fixed X; and ¢;. We have joint measurability by 
the image admissible Suslin property of (’)’. It follows that the right side of 
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(10.45) is 
4 
< E 


’ 


L 2 giôx; 


which then by (10.44), then the triangle inequality, is 


~ Elgi| ns 


Z 32 Bok 
< ———— Ep 
JnE\gi| ? 


= X gidx, 
i=l 


F 
64 1# 

£ —.——— E a 81 5x; 
JnE|gi| pe 


We have E|g)| = /7/2, and the Gaussian process Wp, can be written as 


PEg (10.46) 


F 


Wp, =n!” > gidx,, (10.47) 
i=1 


and so the expression in (10.46) is 


< 64/7/2107? sup E||Wollr, 
QeP(X) 


which by Claim 1 and the fact that we can write 
Wolf) = Gel f) + ZEpf (10.48) 


for a N(O, 1) variable Z independent of G p, recalling that Wp is the isonormal 
process on L*(P), is 


2 
< 64,/x/2n7"/? sup («1co1-+/2). 


QcP;(X) 


Combining, since F € UPGy,, the left side of (10.45) is O(n?) uniformly in 
P € P(X), proving Claim 2. 

Recall ¥’(6, d) as defined in (10.30). The next claim includes a uniform 
asymptotic equicontinuity condition: 


Claim 3. (F, ep) is totally bounded for each P € P(X) and for all £ > 0, 


limlimsup sup Pr(|Va(P:— P)| poe >£) =0- (10.49 
510 PAS PRO vnc Ilres ( ) 


Also, F is universal Donsker. 
Proof of Claim 3. Once (10.49) is proved, it will follow by Theorem 3.34 that 


F is universal Donsker. 
Since F € UPGy, it follows that 


sup{E Gp, l- < 00 
P, 


n, Fn 
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where the supremum is over all possible empirical measures P, := 1 ee 
for x1, . . . , Xn in X. It follows that foreach £ > 0, the supremum over P, of the 


packing numbers D(e, F, pp,) is bounded by the Sudakov minoration Theorem 
2.22(b), specifically, there is a finite c such that for all possible P, on X and 
€ >0, 


log D(e, F, ep,) < c/e’, (10.50) 


first for pp, in place of ep,, and then for ep, with a possibly larger c since F is 
uniformly bounded by 1. 

Next, for each P € P(X), Y, := || Pa — P || (r2 is a reversed submartingale 
by Theorem 6.6. Since (F’)* is uniformly bounded, Y,, converges a.s. and in L! 
to some limit by theorems of Doob (RAP, Theorem 10.6.4). By Claim 2, the 
limit in L!, and so a.s., must be 0, 


lim ||P, — Pli =0. (10.51) 
#-F¥00 


lep (f, 8} — epl f, 8)"| = IPAE — 8) — PCF — 8), 


and so almost surely, for any P € P(X), 
lim sup |ep,(f, 8) — ep(f, 2° | = 0. (10.52) 
n= figeF 


From this and (10.50), for all € > 0, 


sup log D(e, F, ep) < c/e?. (10.53) 
PeP(X) 
Thus (F, ep) is totally bounded, uniformly in P. 

Now let, as previously, {&;};>1 be a sequence of i.i.d. Rademacher variables 
defined on the probability space ([0, 1], U[0, 1]), where we take the product 
space X° x [0, 1]. For a given € > 0 and 0 < ô < £, Lemma 9.6(b) will be 
applied to F’(6, ep) in place of F. Then we will have a < ô. Thus we can take 
any t > /2né, specifically, t = 2./ne. Dividing all four expressions in the 
events whose probabilities are taken in Lemma 9.6(b) by ./n we get equivalent 
events. Let u = t/,/n = 2e. Then we get 


= E 
F'(5,ep) 


1 n 
Pr (livn Fse) > 2e) < 4Pr => i EX 
(yall e.ep) ) leben ) 
Next, we have 


Pr | >e}<T, +h 


1 n 
Ja > & 0x; 
i=l 


F'(ô ep) 
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where 


= € 
F(V28,er,) 


1 n 
Tı := 7,(6, £) := Pr | | — &) 0x; 
Jn 2 
and 
Ty := T2(8) := m( sup ler, (f, 8} — ep(f, g)"| > °) . 
SigeF 


Claim 2 implies that lim,_,.. SUP pep(x) T)(5) = 0 for all ô > 0. 
By (10.48) we have for any probability measure Q on A and 6 > 0, 


2 2 
F'(8,eg) ag So <E |Go la.) + 9: 


Then Gaussianizing and using Jensen’s inequality as in the proof of Claim 2, 


E | Wo| 


die 1 << 
E — J eiôx, < —— Ep Eg —= J giôx | 
i =F i 
vn E RT Igul vni a 
<—— sup EllZol, 
Elgil geP x l lze 


IA 


[Fs E1Gelrumn ty 5 
Tori ol Fo) tyz 


for all P € P(X). Since F €UPG+ it follows that for all € > 0, 


lim sup Tı(ô,£)= 0, 
510 pep(x) 


which proves (10.49) and so Claim 3. 
Claim 4. F € UPG. 


Proof of Claim 4. For G as in Claim 2, we have by the same proof as for (F")" 
(10.51) that for any P € P(X), almost surely 


||P, — Pllg > 0. (10.54) 


Take an œ € X™ such that (10.54) holds. Then for any finite sequence 
fi, ---, fr of functions in F, we have convergence in distribution 


L (GPSi) «++ Geof) > L(GP(fi),---,Gr(fr)). (10.55) 


We know that F is P-Donsker by Claim 3 and thus that it is pregaussian for 
P, so that Gp can be chosen to have a distribution with separable support in 
£°(F). The same is true of each Gp») for our fixed œw. The next aim is to 
show that {G p,()}n>1 is a Cauchy sequence with respect to the distance 6. Let 
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He BLY = {H € BLE“(F)): ||A\laz < 1}. Given t > 0, since pp < ep, 
by (10.52) and (10.53), there is an N < œ and fi, ..., fy € F such that for 
each f € F there is ani < N with pp, (f, f) < t for all m > n. Let i(f, t) 
be the least such i and 2, f := fics). Let Gp,(a),r(f) := G Pw) f). Then, 
taking expectations with respect to the distributions on the left in (10.55), still 


for fixed w, writing P, o := P,(@), we have 
|EH(Gp,,,) — EH(Gp,,,)| < Hi + Ho + Hs (10.56) 


where Hı :=|E H (Gpr, „)— EH (Gr, .1)|-H2 :=|E H(G p, o1) — EH(Gp,,,.c)|, 


and H3 := |EH(Gp,,,) — EH(Gp, ,)|. Then 


m,w 


Hi < E | Gral Paon B S E [Cenal repne 
and so for m > n 
max(Hı, H3) < sup E ||Gol + 3 (10.57) 
OEP; (X) | l- (t, pa) 
For a probability measure Q = P or some P, œ andi, j = 1,..., N, let Cg be 


the covariance matrix with respect to Q, with elements Coi; = Covo(fi, fj), 
and Cyjj := Cp,i;. By (10.54), for the (almost all) œ satisfying it, Chi; > Cpij 
elementwise as n — oo. Since the dimension N is fixed, the convergence also 
holds with respect to any norm on the N(N + 1)/2-dimensional vector space 
Sy of N x N symmetric matrices. Let Hy be the set of nonnegative definite 
elements of Sy, with the topology of any norm on Sy. Then Hy isa locally com- 
pact metrizable and so Hausdorff space. Any continuous one-to-one mapping 
of one compact Hausdorff space onto another has a continuous inverse (RAP, 
Theorem 2.2.11). This holds in suitable locally compact Hausdorff spaces, 
for example, for the map A +> A? of Hy onto itself. So, the inverse map 
C > VC is continuous Hy > Hy with respect to any norm(s) on Sy. Thus 
C, > /Cp as n — oo, and 


m sup Il V Cm = VCall2 =0 


for the Hilbert-Schmidt norm || ||2 used in Lemma 10.25 (or any norm ||: || on 
Sy). It follows from that Lemma that G p, on { fP ı form a Cauchy sequence 
with respect to the bounded Lipschitz distance 8 on R”. Thus 
lim sup sup M, =0. (10.58) 
n> m>n HeBLF 
Since Gp (f) = Wp(f)— Pf forall f € L?°(P), and by (10.47), each G P (o) 
takes values in the finite-dimensional subspace of £°(F) spanned by 6 Xj(w)> 
j =1,...,n, and constant functions. Thus, these processes for all n, for the 
given o, take values in the separable subspace Se of £°(F) of functions of the 
form co + pap c;5x (œ) Such that 7; |c;| < 00. Since F is P-pregaussian by 
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Claim 3, there is a separable subspace Tp of €°(F) in which G p can be defined 

to take values. There is a separable subspace T, of €°°(F) including both So 

and Tp. Each Gp.) has a law, as mentioned in (10.55), which can be taken 

to be defined on Se and thus on Tp. Gp and Gp,.) are measurable random 

variables with values in Tp. It follows from (10.58) (10.57), and F e€UPG, 

that L(G p,(.)) on Tp form a Cauchy sequence for 67, a metric for laws on Tp. 
F €UPGy implies that 


sup EllGoll < œ, (10.59) 
QeP;(S) 

as follows from a Gaussian concentration inequality, Theorem 2.47. Namely, 
for a Gaussian process V with mean 0, defined on a parameter space T separable 
for dy as all Gg processes are for Q € P(X), taking V(-) € L°(P), we have 
in distribution V = L o V, so we can write ess. super |Vi| =a Y := |L(A)|* 
where A := {V(t) € L’(P): t € T}. Here A is a GB-set with EY < oo, and 
Theorem 2.47 and a calculation give E(Y*) < (EY) + K where K is an 
absolute constant depending on the absolute constant C given in Proposition 
2.46. Thus ||G p œ| F are uniformly integrable and E||Gp a) ||-7 > EllGell 
asn — oo. Likewise, E'||Gp,(w)llF(s,02 > EllGellF6,9p- For n large enough, 
by (10.54) as before, 


IGP wllFC.op) < IG PIFS, pp) 
Since F €UPG,, we have 


lim sup E||Gp||F5,o,) = 9. 
510 PeP(X) 


So F €UPG and Claim 4 is proved. 
Claim 5. F is uniform Donsker, i.e., (10.35) holds. 
Proof of Claim 5. Claim 4 implies 
sup EllGp||F <œ, lim sup E||Gp|lFo.p,)=90. (10.60) 
PeP(X) 610 PeP(X) 


If we wanted to prove £ (uP , Gp) — 0 for an individual P, we could prove 


total boundedness of F for pp and an asymptotic equicontinuity condition, 
infer the P-Donsker property of F from Theorem 3.34, and then get the Br 
convergence from Theorem 3.28. For uniform Donsker, (10.53) gives that F 
is pp- and, in fact, ep-totally bounded uniformly in P € P(X). Then Claim 
3, (10.49), gives an asymptotic equicontinuity condition uniformly in P € 
P(X). The conclusion Br(v,, Gp) —> 0, uniformly in P € P(X), which is the 
definition of uniform Donsker, will be proved directly. 

Let P € P(X). Given t > Olet fi,..., fiyp(r) be a maximal set of members 
of F withep(f;, fj) > tforl <i < j < Np(t).Foreach f € Fletr? (f) = 
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fi for the least i such that ep(f, fi) < t. For each f € F and j = 1,2,..., 
let Ve) := f(X;)— Pf and YPS) i= Y? (a? f). By Theorem 3.2 we can 
and do take Gp such that for (almost) all w, Gp is uniformly continuous on F 
for pp and thus for ep. In particular G p takes values in a separable subspace of 
£©(F). Let Gp (f) := Gpr? f) for f € F. Each Y’, and Gp also takes 
values in a separable subspace. Let H € BLY . Then 


|E*H(vP) — EH(Gp)| < 11 + m + n (10.61) 


where 


n= 


’ 


1 n 1 n 
E*H | — X Y?) -EH |—>Y YF 
a 
1 n 
EH | — X Y}. | — EH(Gp +) 


By (10.53), suPpep(x) Np(t) < +00. Then by Lemma 10.24 we have for each 
t>0 


» m := |EH(Gp,,)— EH(Gp)|. 


lim sup {m : P € P(X), H € BLT} =0. (10.62) 
n> co 


For each € > 0 and H € BLI, since for each ¢, Y € F, |H(¢) — H(y)| < 
min(2, || — Wlsup), we have for each P € P(X) 


Nn <E +2Pr(| v| Fcep) > e) ; 
and so by Claim 3 (10.49) 
limh arsip sup {m : Pe P(X), He BLI} =0. (10.63) 
T n—> 0 


Clearly, n3 < ¢ + 2Pr (IIG P || Fep) > £), and so by (10.60) 


lim lim sup sup {n3 : Pe P(X), He BLT} = 0. (10.64) 
T noo 


The displays (10.61) through (10.64) combine to prove Claim 5, i.e. that F is 
uniform Donsker (10.35), and so prove Theorem 10.22. 


Under Pollard’s entropy condition (10.18), Theorem 10.13 can be streng- 
thened as follows: 


Theorem 10.26 [f(X, A) is a measurable space and F is a uniformly bounded, 
image admissible Suslin class of functions on (X, A) satisfying Pollard’s 
entropy condition (10.18), then F is a uniform Donsker class. 
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Proof. Pollard’s entropy condition in this case can be written as 


[0,6] 
/ sup (log D(e, F, eg)” de < œ. (10.65) 
0 QePKX) 


For 6 > 0 let G(8) = J} suPgep x) (log D(e, F, eg)” de. Thus G(5){0 as 


5,0. Using the metric entropy sufficient condition for sample continuity of 
Gaussian processes, in the modulus of continuity form (Theorem 2.36), more 
specifically with expectations (2.25), and Theorem 10.22, give that F is uniform 
Donsker. 


Theorem 10.15 showed that for a universal Donsker class F, the symmet- 
ric convex hull H(F, 1) and its closure H,(F, 1) for sequential pointwise 
convergence are also universal Donsker classes. Bousquet, Koltchinskii, and 
Panchenko (2002) proved (or more precisely proved a fact from which it easily 
follows, as they noted) that the same holds for the uniform Donsker property. 
They give in detail the sufficient fact, interesting in its own right, in the next 
theorem. 

For a probability space (Q, P), a pregaussian set F C £7(P), and 5 > 0 
define the modulus of continuity (in expectation) of Gp on F as 


WF, 5) := E\|Gp|lF“5,pp): 


As F is pregaussian, by Theorem 3.2, we can take Gp to be coherent, i.e., to 
have prelinear and pp-uniformly continuous sample functions on F. Moreover 
by Lemma 2.30, prelinearity on F implies that each Gp(-)(@) has a unique 
linear extension to the linear span of F and thus is uniquely defined on the 
symmetric convex hull of F. On moduli of continuity we then have: 


Theorem 10.27 (Bousquet, Koltchinskii, and Panchenko) For any pregaus- 
sian F and any ô > 0, 


o(H(F, 1), 8) < inf (40(F, 6) +6/D¢e, F, pr)) . (10.66) 


Proof. Let € > 0 and let S C F be a maximal set such that pp( f, g) > € for 
f # gin S, of cardinality N := card(S) = D(e, F, pp). Recall that this must 
be finite since by (Sudakov’s) Theorem 2.19 Ep := E SUP feF Gp(f) < +œ 
and then from the Sudakov minoration Theorem 2.22(b), 


Er > = vlog DG, F, pp). (10.67) 


Consider the Hilbert space H = EXP) ={fe L*(P): Ep f = 0}, with the 
(covariance) inner product. On this H, G p is the isonormal process, while on the 
other hand, for any f € L?(P) and constant c, e.g. Pf, Gp(f —c) = Gp(f). 
Let V be the linear span of S, in the Hilbert space H. Let W be the orthogonal 
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complement of V in H, and let zy and zw be the orthogonal projections onto 
V and W respectively. Then for each f € F, f = ny f + 7w f, and so 


o(H(F,1),8) < E[sup{|Ge(tyh)|: h € H(F, 1y, pp)}] 
+E [sup{|GpCrwh)| : h € H(F, 1V (8, pp)}] 


Since any orthogonal projection T is a contraction, |If — Mell < || f — gll, 
we have 


olH(F, 1), 5) < olny H(F, 1), 5) + o(awH(F, 1), 8). (10.68) 


To bound the first term on the right of (10.68), let d be the dimension of V, for 
which d < D(e, F, pp). We have 


olny (HF, 1)), ô) < oCV, ô), 
and since Gp is linear and V is a vector space, 
oA V, 8) < ŝEIIZI| < êE ZIA"? 


where Z is a standard normal d-dimensional vector, so that 


(V, 8) < dVd < ôy Dle, F, pp). (10.69) 


For the second term on the right of (10.68), we have a crude bound, which will 
be sufficient, 


wlrwH(F, 1), 8) < 2E [sup{|G Prw P): f € H(F, Di]. 


Since zw and Gp are linear, the supremum is attained at extreme points of 
H(F, 1), which are functions + f for f € F, from which we get 


o(atwH(F, 1), 8) < 2E [sup{iGe(rw f): f € F}]. 


For S={fi,..., fw}, given f € F, let g = fj be the closest point of S to f, 
or the one with smallest j in case of a tie. Then || f — g|| < € andg e VAF, 
so Twg = 0 and 


< 2E [sup{|G p(w f) — Ge(awa)l: fg EF, If- alsa] 
= 2E||Gp 0 Tw || Fe, pp): 


Since zw is a contraction, the last expression can be bounded above using an 
inequality related to Slepian’s, (2.21) in Theorem 2.18, by 


4E||G ell Fve.pp) = 40(F, 8). 


Combining this with (10.69) gives (10.66). 
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Now the following will be proved: 


Theorem 10.28 Let (X, S) be a measurable space and F a uniform Donsker 
class of functions on X which is uniformly bounded and image-admissible 
Suslin. Then the symmetric convex hull H(F, 1) of F and its set H,(F, I) of lim- 
its under sequential pointwise convergence are also uniform Donsker classes. 


Proof. It suffices by Theorem 10.22 to show that if F is uniformly pre- 

gaussian, then so are the other two classes of functions. By (10.31) we have 

M := sup pepxy) EllGp|lz < co where in (10.67) EF < M forall P. Moreover, 

|G lla) = ||Gellz, so that (10.31) holds for H(F, 1) with the same M. 
For each ¢ > 0, and all P € P(X), by (10.67) 


N := D(e, F, pp) < No(e) := exp(289M? /e7). 


To prove (10.32) for H(F, 1), as 510, it will suffice to choose ¢ = e(ô) in 
(10.66) which also converges to 0 and is such that 5./No(€) converges to 
0, so that the right side of (10.66) will converge to 0. For the latter it will 
suffice that No(e) < 1//5. We can set £ = 17./2M/,/log(1/5) which does 
converge to 0 as ô }0 (although, of course, more slowly than ô does). Sequential 
pointwise limits do not change any ||Gp||g¢ or ||Gp|lo(s,o,), from G = H(F, 1) 
toG = H,(F, 1), so H,(F, 1)is also uniformly pregaussian and thus uniformly 
Donsker. 


10.5 Universal Glivenko—Cantelli Classes 


Given a measurable space (X, A), a class F of measurable real-valued functions 
on X is called a universal Glivenko—Cantelli class if it is Glivenko—Cantelli for 
each law P on (X, A). The notion of universal Glivenko—Cantelli class seems 
to be excessively general. For example, if X is a countably infinite set, then the 
class 2* of all its subsets is Glivenko—Cantelli for every P defined on it, but it 
is not uniformly Glivenko—Cantelli (Problem 5); see also Problem 6. Dudley, 
Giné and Zinn (1991) give various examples of unexpectedly large Universal 
Glivenko—Cantelli classes. Some of them could well be called pathological. It 
seems that uniform Glivenko—Cantelli classes are of considerable interest in 
statistics and machine learning (e.g., Alon et al. 1997), where P is unknown, 
but universal Glivenko—Cantelli classes may not be. 


Problems 


1. Let V be a finite-dimensional vector space of real-valued functions on a set 
X. Let W be a subset of V consisting of functions f with f(x) > 0 for all 
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x € X. Show that the set of all quotients f/g for f € V and g € W isa VC 
major class. Hint: For any real t, f/g > t if and only if f — gt > 0. Apply a 
fact from Chapter 4. 


2. For d = 1,2,... and k = 1,2, ..., let Vg q be the vector space of all poly- 
nomials on R? of degree at most 2k. For M with 0 < M < +00, let Vk a,m be 
the set of all f € Vz a, all of whose coefficients have absolute values < M. Let 
F be the set of all quotients x œ> f(x)/(1 + |x|?) for f € Vz. a.m. Show that 
F is a universal Donsker class. Hints: Apply Problem 1 and Corollary 10.21. 
One needs to show that F is uniformly bounded. A bound on the dimension of 
Vk, a Will help. To prove that F is image admissible Suslin, use that the set of 
possible vectors of coefficients is a Polish space and show joint measurability. 


3. Show that each class Ge, a in Theorem 8.4 for a > 0 and K < œ is a 
uniform Glivenko—Cantelli class. Hint: The sets Gy. x,a are uniformly bounded 
and totally bounded for dsup, by the Arzela—Ascoli theorem or more specifically 
by Theorem 8.4. 


4. Show that the “ellipsoid” universal Donsker class of Proposition 10.17 is not 
uniform Donsker. 


5. For a countably infinite set, say the set N of nonnegative integers, show that 
the class C = 2N of all subsets is a universal Glivenko—Cantelli class as shown 
in Problem 8 of Chapter 6. Show that it is not a uniform Glivenko—Cantelli 
class. 


6. Let (X, A) be a measurable space and let X be the union of a sequence of 
disjoint measurable sets A;. Suppose that for each k, Cx is an image-admissible 
Suslin VC class of sets, where S(C,) may go to oo arbitrarily fast as k —> oo. 
Show that the collection C of all sets A such that A N Ag € Cx for each k is a 
universal Glivenko—Cantelli class. Hint: Use the result of the previous problem 
and the Glivenko—Cantelli property of each VC class Cx. 


7. If F is a universal Glivenko—Cantelli class of functions on (X, A), then F is 
uniformly bounded up to additive constants, i.e., the conclusion of Proposition 
10.10 holds for it. Hint: For each f € F and law P on (X, A), P(| f|) must be 
finite in order for (P, — P)(f) to make sense. Adapt the first part of the proof 
of Proposition 10.10, using 8* in place of 2". 


Notes 


Notes to Section 10.1. The main Theorem 10.6 was proved by Dudley, Giné, 
and Zinn (1991, Theorem 6). I am very indebted to Evarist Giné for suggestions 
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on the proof given here. Theorem 10.8 is essentially Lemma 2.9 of Giné and 
Zinn (1984), for which they give credit to G. Pisier and X. Fernique. Later, 
Alon, Ben-David, Cesa-Bianchi, and Haussler (1997) gave another, interesting 
characterization of the uniform Glivenko—Cantelli property. 


Notes to Section 10.2. This section is based on Dudley (1987). 


Notes to Section 10.3. Theorem 10.19 is from Dudley (1987). The argument 
of B. Maurey used in its proof was published in the proofs of Pisier (1981), 
Lemma 2, and Carl (1982), Lemma 1. The example showing that 2A /(2 + A) 
is sharp is from Dudley (1967a), Propositions 5.8 and 6.12. van der Vaart and 
Wellner (1996, Theorem 2.6.9) and Carl (1997) showed that one can take t = s 
in Theorem 10.19. 


Notes to Section 10.4. The section is based on the paper of Giné and Zinn 
(1991). Lemma 10.24 is their Lemma 2.1, which they say is well known; the 
bound (10.36) is used in the 1991 proof and given with proof, for d = 1 but the 
proof directly extends to all d, in Araujo and Giné (1980, Theorem 1.3). 

In Lemma 10.25 and its proof, I do not claim that among random variables 
X, Y with distributions N(O, C) and N(0, D), X = VCZ and Y = VDZ with 
Z having N(0, I4) achieve the minimum of E|X — Y 7, even requiring (X, Y) to 
have a Gaussian joint distribution. According to Olkin and Pukelsheim (1982), 
they do if C and D commute, but not in general. 

It appears that Dobrushin (1970) proposed the name “Wasserstein metric,” 
referring to a paper by VaserStein (1969). Cuesta and Matran (1989) say such 
metrics are due to Kantorovich (1942). The Kantorovich—RubinStein theorem 
about this metric was published in Kantorovich and RubinStein (1958). Some 
authors later have used the name “Kantorovich metric.” Kantorovich’s work on 
“transportation problems” has become quite widely known. 

The main theorem characterizing uniform Donsker classes, Theorem 10.22, 
is Theorem 2.3 of Giné and Zinn (1991). The fact that Pollard’s entropy condi- 
tion, with boundedness and measurability, implies uniform Donsker, Theorem 
10.26, follows from Proposition 3.1 of Giné and Zinn (1991) as stated there, 
although a different method of providing a sample modulus in expectation was 
used here. 

Bousquet, Koltchinskii, and Panchenko (2002) essentially proved Theorem 
10.28 on preservation of the uniform Donsker property by convex hulls, in that 
they did prove Theorem 10.27 from which it easily follows. (They gave a factor 
of 2 rather than 4 in (10.66); the 4 is used here to allow a more self-contained 
proof based on (2.21).) 
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Classes of Sets or Functions Too Large for 
Central Limit Theorems 


11.1 Universal Lower Bounds 


This chapter is primarily about asymptotic lower bounds for ||P, — P || F on 
certain classes F of functions, as treated in Chapter 8, mainly classes of indi- 
cators of sets. Section 11.2 will give some upper bounds which indicate the 
sharpness of some of the lower bounds. Section 11.4 gives some relatively 
difficult lower bounds on classes such as the convex sets in R? and lower 
layers in R*. In preparation for this, Section 11.3 treats Poissonization and 
random “stopping sets” analogous to stopping times. The present section gives 
lower bounds in some cases which hold not only with probability converging 
to 1, but for all possible P,,. Definitions are as in Sections 3.1 and 8.2, with 
P := U(I4%) = àf = Lebesgue measure on I“. Specifically, recall the classes 
G(a, K,d) := Ga.x.a of functions on the unit cube J d c R? with derivatives 
through ath order bounded by K, and the related families C(œ, K, d) of sets 
(subgraphs of functions in G(a, K, d — 1)), both defined early in Section 8.2. 


Theorem 11.1 (Bakhvalov) For P = U(I%), any d=1,2,... and a > 0, 
there is a y = y(d,a) > 0 such that for all n =1,2,..., and all possible 
values of P,, we have || Pn — P|\gq.1,a) = yn, 

Remarks. When a < d/2, this shows that G(a, K, d), K > 0, isnot a Donsker 
class. For a > d/2, G(a, K,d) is Donsker by Corollary 8.5(b). The lower 
bound in Theorem 11.1 is not useful, since it is smaller than the average size of 
| Pn — Pllgve,1,a), which is at least of order n—'/?; even for one function f not 
constant a.e. P, E|(P, — P)(f)| = cn—'/? for some c > 0. For a = d/2 and 
F := G(d/2, 1, d), recalling v, := ./n(P, — P), always ||vnl|z > y > 0,80 F 
is not P-Donsker, as follows. Theorem 2.32(b) says that for a pregaussian class 
F (a GC-set), ||Gp||z7 < y with positive probability, so v, cannot converge to 
it in law (see Problem 3). 
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Theorem 11.1 gives information about accuracy of possible methods of 
numerical integration in several dimensions, or “cubature,” using the values of 
a function f € G(«œ, K, d) at just n points chosen in advance (from the proof, it 
will be seen that one has the same lower bound even if one can use any partial 
derivatives of f at the n points). It was in this connection that Bakhvalov (1959) 
proved the theorem. 


Proof. Given n let m := m(n) := [(2n)'/¢] where [x] is the smallest integer 
> x. Decompose the unit cube I into m4 cubes C; of side 1/m. Then m’ > 2n. 
For any P, let S := {i: P,(C;) = 0}. Then card($) > n. For xq), fi and fs as 
defined after (8.11), we then have for each i, 


P(fi) m= f fon —xji))dx = m f f(m(x — xa ))dx 


= i FOdy/m** = mP). 


Thus |(P, — P)(fs)| = P(fs) = m-*—4n P(f) >cn~°*/4 for some constant 
c = c(d,a) > 0, while || fslla < || f lla < 1 (dividing the original f by some 


constant depending on d and a if necessary). 


Theorem 11.2 For P = U(I®), any K > QandO <a < d — 1 there is a ô = 
ôl&æ, K, d) > 0 such that for all n = 1,2, ... and all possible values of P,, 


|P, — Plicta.x,a) > 6n~V/ GV), 


Remark. Since a/(d — 1 + a) < 1/2 for a < d — 1, the classes C(a, K, d) 
are then not Donsker classes. For a > d — 1, C(a, K, d) is a Donsker class 
by Theorem 8.4 and Corollary 7.8. For a = d — 1, it is not a Donsker class 
via Theorem 2.32(b) as in the remarks after the preceding theorem. Theorem 
11.10 below will show that ||v,||c(da-1,K,a) is unbounded in probability (at a 
logarithmic rate). 


Proof. Again, the construction and notation around (8.11) will be applied, 
now on I®!. Let c := KATU f)/|| flle, m i= m(n) := |(nc)!/€+4-0 |, Then 
l/n < cm!-4-". Take n > r (the result holds for n < r) for an r with 


M := sup nc m(n)'**% < œ. 


Let ©, := m(n) ti! /(nc). Then 1/M < ©, < 1. Let g; := KO, f:/ 
(2|| flle). Then 47-!(g;) = 1/(2n) and |Igille < K. Let B; := {x € If: 0< 
Xa < gi(Xay)} where xa) := (%1,...,X¢-1). Then for each i, A4(B;) := 
P(B;) = 1/(2n) and either P,,(B;) = 0 or P,(B;) > 1/n. Either at least half 
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the B; have P,,(B;) = 0, or at least half have P,,(B;) > 1/n. In either case, 
|P, — Plica.x,a) = m“"'/(4n) > cm~“/(4M) 


> bn 8/@td-V)) 


for some d(a, K,d) > 0. 
The method of the last proof applies to convex sets in Rf, as follows. 


Theorem 11.3 (W. M. Schmidt) Let d = 2,3,.... For the collection Ca of 
closed convex subsets of a bounded nonempty open set U in R? there is a 
constant b := b(d,U) > 0 such that for P = Lebesgue measure normalized 
on U, and all P,, 


sup{|(P, — P\(C)|: C € Ca} > bn”, 


Proof. We can assume by a Euclidean transformation that U includes the unit 
ball. Take disjoint spherical caps as at the end of Section 8.4. Let each cap C 
have volume P(C) = 1/(2n). Then as n — oo, the angular radius £, of such 
caps is asymptotic to cyn~'/+ for some constant c4. Thus the number of such 
disjoint caps is of the order of n“~!/¢+!), Either (P, — P)(C) > 1/(2n) for 
at least half of such caps, or (Pa, — P)(C) = —1/(2n) for at least half of them. 
Thus for some constant n = n(U, d), there exist convex sets D, E, differing by 
a union of caps, such that 


ICPn — PXD) = (P, = PXE) > mD =, 


and the result follows with b = 7/2. 


Thus for d > 4, Cg is not a Donsker class for P. If d = 3, it is not either, 
by the same argument as in the remarks after the previous two theorems; 
cf. Problem 3. C) is a Donsker class for 47 on I? by Theorem 8.25 and 
Corollary 7.8. 


11.2 An Upper Bound 


Here, using metric entropy with bracketing N; as in Section 7.1, is an upper 
bound for ||v,\lc := SUP gee |VYn(B)|, which applies in many cases where the 
hypotheses of Corollary 7.8 fail. Let (X, A, Q) be a probability space, vn := 
n'/2(Q, — Q), and recall N; as defined before (7.4). 


Theorem 11.4 LetC CA, 1 <6 < œ, n > 2/(€4+1) and O := (6 — 
1)/(2¢ + 2). Iffor some K < œ, N7(e,C, Q) < exp(Ke-S), 0 < € < 1, then 


lim Pr*{|lvallc > n° (log n)"} = 0. 
n—->oo 
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Remarks. The statement is not interesting if C is a Q-Donsker class, as © > 0, 
n > 0, and then Pr*{||v,\|c > an} — 0 for any an — +00. Nor is the statement 
useful in proving classes are not Donsker because for that we need lower, not 
upper, bounds. It is useful in showing that some lower bounds are sharp, with 
respect to ©. (With respect to 7, it is not sharp as will be seen in Section 11.4.) 
The classes C = C(a, M, d) satisfy the hypothesis of Theorem 11.4 for 
6 =(d—1)/a > 1,i.e.œ < d — 1, by the last inequality in Theorem 8.4. Then 
O= 4 — gipa: Thus Theorem 11.2 shows that the exponent © is sharp for 
¢ > 1. Conversely, Theorem | 1.4 shows that the exponent on n in Theorem 11.2 
cannot be improved. In Theorem 11.4 we cannot take ¢ < 1, for then © < 0, 
which is impossible even for a single set, C = {C}, with 0 < P(C) < 1. 


Proof. The chaining method will be used as in Section 7.2. Let log, be logarithm 
to the base 2. For each n > 3 let 


k(n) := (5 — o) log, n — n log, log J i (11.1) 


Let N(k) := N;(2-*,C, Q), k= 1,2, ... . Then for some Aj; and By € 
A, i =1,..., N(k), andany A € C, therearei, j < N(k)with Azi C A C Bg; 
and Q(Bkri\ Azi) < 2-*, Let Ao; := Ø (the empty set) and Bo, := X. Choose 
such i = i(k, A) for k = 0,1,... . Then for each k > 0, 


O(Arig a) A Ag—1,i@—1,4)) < 2?* 


where A := symmetric difference, CAD := (C \D)U(D\C). 
For k > 1 let B(k) be the collection of sets B, with Q(B) < 27“, of the 
form Agzi\Ak-1,j Or Ak—1,j\Agzi Or Bki\ Ari. Then 
card(B(k)) < 2N(k — 1)N(k) + N(k) < 3 exp(2K2**). 


For each B € B(k), Bernstein’s inequality (Theorem 1.11) implies, for any 
t>0, 


Pr{|v,(B)| > t} < 2exp(—27/(23-* + tn“). (11.2) 


Choose ô > 0 such that ô < 1 and (2 + 28)/(1 + ¢) < n. Let c := 6/(1 + ô) 
and t := t, := cn? (log n)"k—!~*, Then for each k = 1,..., k(n), 234 > 
8n®-!/2log n)” > tnn !?. Hence by (11.2), 


Pr{lvn(B)| > tnk} < 2exp(—t, ,/2**), and 


Pnk = Pr} sup |v,(B)| > thet < 6exp (2K2 — 2* 47? ,) 
BeB(k) i 


= Gexp(2K2" — 24n logn)" k?) . 
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Fork < k(n) we have 2 (logn)! < n'/?~®. Since © < 1/2 and 20/(5 — 


©) =¢ — 1, we have n?° > 2*6—-Ddogn)"6-), and 
Pak < 6exp (2x2 Ole i aes 


Lety := n(¢ + 1) — 2 — 26 > 0 by choice of ô. Since 5 < log2, k(n) < logn 
and Pak < 6 exp(2 (2K — 274c? (log n)”)). 
For n large, logn)” > 64K /c?. Then 2K < 2~>c*(logn)” and 2 > 1 for 

k > 1,s0 

k(n) 

bD Pak < 6(log n) exp(—275ce(logn)”) > 0 asn> œ. 

k=1 
Let €, := {w: SUP gegi) |Va(B)| < tnr, r = 1, .. . , k(n)}. Then lim, oo Pr(En) 
= 1. Forany A € Candn, letk := k(n) and i := i(k, A). Then for eachw € En, 
|v, (Ax; )| is bounded above by 


k 
> [Va (Arit, a) \ Ar—1ic—1,ay)] + [Va (Ar ir- \ Arica) 


r=1 


k 
< 2Y tas < 2n (log n)” Sa < 2n°(logn)", 


r=1 r>1 
[vn(Bki\Axi)| < n©(ogn)", and by (11.1), 
n! O(Bu\ Ari) < n?2/2* < 2n®(dogn)’. 


Hence n!/? Q, (Bri \Azi) < 3n (log n)”, and |v, (A\Azi)| < 3n (log n)". So on 
En, |Vn(A)| < 5n ogn)”. As n | 2/(¢ + 1) the factor of 5 can be dropped. 
Since A € C is arbitrary, the proof is done. 


11.3 Poissonization and Random Sets 


Section 11.4 will give some lower bounds ||v, ||c > f(n) with probability con- 
verging to | as n + oo where f is a product of powers of logarithms or 
iterated logarithms. Such an f has the following property. A real-valued func- 
tion f defined for large enough x > 0 is called slowly varying (in the sense of 
Karamata) iff for every c > 0, f(cx)/f(x) ~ las x > +00. 


Lemma 11.5 Jf f is continuous and slowly varying, then for every € > 0 
there is a 5 = 6(€) > 0 such that whenever x > 1/6 and |1 — =| < ô we have 


_ f) 
| l= 465 


SE 
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Proof. For each c > Q0 and € > 0 there is an x(c,€) such that for x > 
TO T 1| < a Note that if x(c, €) < n, even for one c, then f(x) 4 0 
for all x > n. By the category theorem (e.g., RAP, Theorem 2.5.2), for fixed 
€ > 0 there is an n < oo such that x(c, ¢) <n for all c in a set dense in 
some interval [a, b], where 0 < a < b, and thus by continuity for all c in 
[a, b]. Then for c,d € [a, b] and x > n, |(f(cx) — f(dx))/f(x)| < ¢/2. Fix 


x(c, €), 


= (a + b)/2. 
Let u := cx. Then for u > nc, we have 
f(ud/c) 1 | flu) 
fu) flu/c)| ~ =e 


As u > +œ, f(u)/f(u/c) > 1. Then there isa ô >0 with ô< (b — a)/ 


(b + a)suchthatforu > 1/6 andallr with |r — 1| < we have fen — 1| <E. 


Recall the Poisson law P, on N with parameter c > 0, so that P.(k) := 
e~‘ck/k! for k =0,1,.... Given a probability space (X, A, P), let Ue be 
a Poisson point process on (X, A) with intensity measure cP. That is, for 
any disjoint A1, ..., Am in A, Ue(Aj) are independent random variables, 
j=1,...,m, and for any A € A, U.(A)(-) has law P-pcay. 

Let Y,(A) := (U. — cP)(A), A € A. Then Y, has mean 0 on all A and still 
has independent values on disjoint sets. 

Let x(1), x(2), . . . be coordinates for the product space (X°, A®, P%). For 
c > 0 let n(c) be a random variable with law P., independent of the x(i). Then 
for P, := n7! (ssa) + +++ + ôx), n = 1, Po := 0 we have: 


Lemma 11.6 The process Ze := n(c) Pao) is a Poisson process with intensity 
measure cP. 


Proof. We have laws L(U.(X)) = L(Z.(X)) = P}. If X; are independent Pois- 
son variables with £(X;) = Paa) and X; c(i) = c, then givenn := 07", Xi, 
the conditional distribution of {X;}/_; is multinomial with total n and proba- 
bilities p; = c(i)/c. Thus U, and Z, have the same conditional distributions 
on disjoint sets given their values on X. This implies the Lemma. 


From here on, the version U, = Ze will be used. Thus for each w, U,(-)(w) is 
a countably additive integer-valued measure of total mass U.(X)(@) = n(c)(@). 


Then 
Ye = n(c) Pro —cP = n(o) Pro = P) + (n(c) — c)P, a1 3) 
Ye /c! = (MAY P vro + (n(c) — ce“? P. l 


The following shows that the empirical process v, is asymptotically “as 
large” as a corresponding Poisson process. 
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Lemma 11.7 Let (X, A, P) be a probability space and C C A. Assume that 
for each n and constant t, sup jcc |(Pn — tP)(A)| is measurable. Let f be a 
continuous, slowly varying function such that as x — +00, f(x) — +00. For 
b > Olet 


g(b) := liminf Pr{sup |Y,(A)| > bf(x)x"7}. 
x—>-+00 AeC 
Then for anya < b, 
lim inf Pr{sup |v, (A)| > af (n)} = g(d). 
n> AEC 


Proof. It follows from Lemma 11.5 that f(x)/x — 0 as x —> oo. From (11.3), 
sup jcc |Yx(A)| is measurable. As x — +00, Pr(n(x) > 0) > Landn(x)/x > 
1 in probability. If the Lemma is false, there is a © < g(b) and a sequence 
mg — +00 with, for each m = mx, 


Pr{sup |vn(A)| > af(m)} < ©. 
AEC 


Choose 0 < € < 1/3 such that a(1 + 7e) < b. Then let 0 < ô < 1/2 be such 
that 6 < 6(e) in Lemma 11.5 and (1 + 6)(1 + 5£) < 1 + 6e. We may assume 
that for all k = 1,2,..., m =m, > 2/5 and 1 + 2e < ae f(m)'/?, 

Set ôm := (f(m)/m)!?. Then since f(x)/x — 0 we may assume ôm < 
6/2 for all m = mx. Then for any m = mx, if (1 — 6,,)m <n < m, then for 
all A € A, mPy(A) > nP,(A), so m(P, — PXA) > n(P, — PXA) — mbm. 
Conversely n P (A) > mP,,(A) — mêm and 


n( Pa E P)(A) = m(Pin a PXA) = Mbp. 
Thus 
MlA] > (n/m)'\v,(A)| fm > (1 +8)! [va A)] fn), 


so |vn(A)] < (1 + 8)(l¥m(A)| + f(m)". Next, |1 — 2] = 2-1 < 28, < ô 


n n 


implies Lm) — 1| < £, so 1/f(n) < (1 + £)/f(m). Hence 


IW CADI/F A) < (+ ENAT! + fn). 
Thus since (1 + Qe) f(m! <ae, Pr{supyce |vn(A)| > af (n) + 3e)} < 
O. 
For each m = mg, set C = Cm = (1 — 58m)M. Then as k > œœ, since 
(mx) —> œ, by Chebyshev’s inequality and since P, has variance c, 
Pr{(1 — 5,)m < n(c) < m} > 1. 
Then for any y with © < y < g := g(b) and k large enough, since the x(i) 


are independent of n(c), 


Prts [vno (A)| > af (c) + 38)} < y. 
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Since ôm < ô(£)/2, for (1 — ôm)m < n < m we have |i — £©) < e, Thus for 


fo 


k large enough we may assume 
Pr{sup |v (A)| > af(c) + 5e)} < y. 
AEC 


For k large enough, applying Chebyshev’s inequality to n(c) — c, we have since 
c> œ 


a 1+6e ca Qecl/? | g-y 

Pr 4 {| —— > < Pr > < 1—5, 

č 1+5e c!/2 (1+5e) 4 

and since f(c) > o, Pr{[n(c) — tie? >af(cje} < (g — y)/4. Thus by 
(11.3), 


Pr {sep [¥.(A)| = acl’? (OA + w) < (y +8)/2 < 8, 
AEC 


a contradiction, proving Lemma 11.7. 


Next, the Poisson process’s independence property on disjoint sets will 
be extended to suitable random sets. Let (X, A) be a measurable space, and 
(Q, B, Pr) a probability space. A collection {54 : A € A} of sub-o-algebras 
of B will be called a filtration if By C Bg whenever A C B in A. A stochastic 
process Y indexed by A, (A, w) +> Y(A)(œ), will be called adapted to {By : 
A € A} if for every A € A, Y(A)(-) is 84 measurable. Then the process and 
filtration will be written {Y (A), Ba}aea. A stochastic process Y: (A, œ) —> 
Y(A)(@), A € A, w € Q, will be said to have independent pieces iff for any 
disjoint A1, ..., Am € A, Y(Aj;) are independent, j = 1,...,m, and Y(A; U 
Az) = Y (A1) + Y (A2) almost surely. Clearly each Y, has independent pieces. 
If in addition the process is adapted to a filtration {54 : A € A}, the process 
{Y(A), Ba}aca will be said to have independent pieces iff for any disjoint sets 
Aj,..., An in A, the random variables Y(A2),..., Y(An) and any random 
variable measurable for the o-algebra 64, are jointly independent. 

For example, for any C € A let Bc be the smallest o-algebra for which 
every Y(A)(-) is measurable for A C C, A € A. This is clearly a filtration, and 
the smallest filtration to which Y is adapted. 

A function G from Q into A will be called a stopping set for a filtration 
{B,: A € A} iff for all Ce A, {w: G(@) C C} € Bc. Given a stopping 
set G(-), let Bg be the o-algebra of all sets B € B such that for every C € 
A, BN{G C C} € Bc. (Note that if G is not a stopping set, then Q ¢ Bg, so 
Bg would not be a o-algebra.) If G(w) = H € A, then it is easy to check that 
G is a stopping set and Bg = By. 


Lemma 11.8 Suppose {Y(A), Bayaca has independent pieces and for all w € 
Q, G(w) € A, Alw) € Aand E(a) € A. 
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Assume that: 


(i) G(-) is a stopping set; 
(ii) For all w, G(@) is disjoint from A(@) and from E(a); 
(iii) Each of G(@), A(@) and E(@) has just countably many possible values 
G(j):= Gj € A Cli) :=C; € A, and D(j) := Dj; € A respectively; 
(iv) For alli, j, {A() = Ci} € Bg and {E(-) = Dj} € Bg. 


Then the conditional probability law (joint distribution) of Y(A) and Y(E) 
given Bg satisfies 


L£{(Y(A), Y(E))|Be} = a LraQacw,EO=D(py£(V (Ci), Y(D;)) 
ij 
where L(¥(C;), Y(D;)) is the unconditional joint distribution of Y(C;) and 
Y(Dj;). If this unconditional distribution is the same for all i and j, then 
(Y(A), Y(E)) is independent of Bg. 


Proof. The proof will be given when there is only one random set A(-) rather 
than two, A(-) and E(-). The proof for two is essentially the same. If for some 
w, A(@) = C; and G(w) = Gj, then by (ii), Ci N G; = Ø (the empty set). 
Thus Y(C;) is independent of Bgg). Let B; := B(i) := {A(-) = Ci} € Bg by 
(iv). For each j, 


{G = G;} = tG c GALU HG c Gi}: G; C Gj, Gi #G;}, 


I 


so by (i) 
{G = G;} € Beg- (11.4) 


Let H; := {G =G;}.ForanyD € A, H; N {G C D} = Ø e BpifG; ¢ D. 
If G; C D, then H; N {G C D} = H; € Bog) C Bp by (11.4). Thus 


H(j) := Hj; € Bg. (11.5) 
For any B € Bg, by (11.4) 
BAH; =BN{GC Gj} AH; € Beg) (11.6) 
We have for any real t almost surely, since B; € BG, 


Pr(Y(A) < t|Bg) = X Pr (A) < t|Ba) leol Hg) 
i,j 
= XO Pr (C) < t|Ba) l rol Hg) 
i,j 
The sums can be restricted to i, j such that B; N H; # Ø and therefore C; N 
G; = Ø. Now Pr(Y(C;) < t|Bg) is a Bg-measurable function f; such that for 
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any T € Bg, 


Pr({¥(Ci) < NT) = fp fidPr. 


Restricted to H;, f; equals a Bg ;)-measurable function by (11.6). By (11.5), 
rA H; € Bg, so 


Srono fi d Pr = Pr({Y(C;) < t} OTN Aj) 
= Pr(Y¥(C;) <p) Par N H;) 


because by (11.6) T A A(j) € Bau), which is independent of {Y(C;) < t}. 

One solution f; to (11.7) for all j is f; = Pr(Y(C;) < t). To show that 
this solution is unique, it will be enough to show that it is unique on H; for 
each j. First, for a set A C H; it will be shown that A € Bgo) if and only 
if A € Bg. By (11.4) and (11.5) Hj; itself is in both Bg and Bgg). We can 
write A = A N H(j). Then “if” follows from (11.6). To prove “only if” let 
AE Boj): Then for C € A, let 


(11.7) 


F := AN{GCC}=AN{G; CC}. 


If G; C C, then F = A € Bgu) C Bc. Otherwise, F is empty and in Bc. In 
either case F € Bc so A € Bg as desired. 

Thus, if g is a function on H;, measurable for Bgg) there, and Sona) 
gdPr = 0 forall D € Bg, then the same holds for all D € Bgg). So, as in the 
usual proof of uniqueness of conditional expectations, 


Sissons d Pr = S te<oinny8 dPr = 0, 


so g=0on H(j) as. and f; = Pr(Y(C;) < t) as. on H(j) for all j. Thus 
Pr(Y(A) < t|Bg) = 0; Pr (C) < law. 


Here is another fact about stopping sets, which corresponds to a known 
fact about nonnegative real-valued stopping times or Markov times (e.g., RAP, 
Lemma 12.2.5): 


Lemma 11.9 If G and H are stopping sets and G C H, then Bg C By. 


Proof. For any measurable set D and A € Bg, we have 


AN{H C D} = (AN {G c DYN {H C D} € Bp. 


11.4 Lower Bounds in Borderline Cases 


Recall the classes C(a, K, d) of subgraphs of functions with bounded deriva- 
tives through order @ in R, defined in Section 8.2. We had lower bounds 
for P, — P on C(a, K, d) in Theorem 11.2, which imply that for a < d — 1, 
lVnllew,K,d) > œ surely as n — oo. Fora > d — 1, C(a, K, d) is a Donsker 
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class by Theorem 8.4 and Corollary 7.8, so ||vn ||C(œ,K,a) is bounded in proba- 
bility. Thus œ = d — | is a borderline case. Other such cases are given by the 
class LL, of lower layers in R? (Section 8.3) and the class C3 of convex sets 
in R? (Section 8.4), for 44 = Lebesgue measure on the unit cube J d where 
I := [0,1]. 

Any lower layer A has a closure A which is also a lower layer, with A1 (A \ 
A) = 0, where in the present case d = 2. It is easily seen that suprema of our 
processes over all lower layers are equal to suprema over closed lower layers, 
so it will be enough to consider closed lower layers. Let LL? be the class of all 
closed lower layers in R?. 

Let P = A“ and c > 0. Recall the centered Poisson process Y, from Section 
11.3. Let Ne := Ue — V, where U, and V; are independent Poisson processes, 
each with intensity measure cP. Equivalently, we could take U, and V, to be 
centered. The following lower bound holds for all the above borderline cases: 


Theorem 11.10 For any K > Qand ô > 0 there isa y = y(d, K, 8) > 0 such 
that 


lim Pr {{I¥elle > yx log x)" (log log x) ®="?} = 1 
X—> +00 
and 


lim Pr flv, lle > ydogn)'/*doglogny*"'/7} = 1 
n> 


where C = C(d — 1, K, d), d > 2, or C = LL,, or C = C3. 


For a proof, see the next section, except that here I will give a larger lower 
bound with probability close to 1, of order (logn)*/4 in the lower layer case 
(C = LLa, d = 2). Shor (1986) first showed that £ || Y, |e > yx!/?(log x)°/4 for 
some y > 0 and x large enough. Shor’s lower bound also applies to C(1, K, 2) 
by a 45° rotation as in Section 8.3. For an upper bound with a 3/4 power of the 
log also for convex subsets of a fixed bounded open set in RÌ, see Talagrand 
(1994, Theorem 1.6). 

To see that the supremum of Ne, Ye or an empirical process v, over LL73 is 
measurable, note first for P, that foreach F C {1,...,n}andeach w, there isa 
smallest, closed lower layer L p (œw) containing the x; for j € F, with Lr(@) := 
Ø for F = Ø. For any c > 0, wb (P, — cP) (Lr(@))(@) is measurable. The 
supremum of P, — cP over LL32, as the maximum of these 2” measurable 
functions, is measurable. Letting n = n(c) as in Lemma 11.6 and (11.3) then 
shows sup{Y.(A): A € LL2} is measurable. Likewise, there is a largest, open 
lower layer not containing x; for any j € F, so sup{|Y.(A)|: A € LL2} and 
sup{|v,(A)| : A € ££} are measurable. 

For N., taking noncentered Poisson processes U. and V, their numbers 
of points m(q@) and n(@) are measurable, as are the m-tuple and n-tuple of 
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points occurring in each. For each i = 0,1,...,m and k = 0, 1,...,n, it is 


a measurable event that there exists a lower layer containing exactly i of the 
m points and k of the n, and so the supremum of N, over all lower layers, 
as a measurable function of the indicators of these finitely many events, is 
measurable. 


Theorem 11.11 (Shor) For every ¢ > 0 there is a 5 >Q such that for the 
uniform distribution P on the unit square I*, and n large enough, 


Pr (sup{|vn(A)| : A € LL} > d(logn)*/*) > 1—e, 


and the same holds for C(1, 2, 2) in place of LL. Also, v, can be replaced by 
N:/c! or Y,/c!/? iflogn is replaced by logc, for c large enough. 


Remark. The order (log n)*/* of the lower bound is best possible, as there is an 
upper bound in expectation of the same order, not proved here; see Rhee and 
Talagrand (1988), Leighton and Shor (1989), and Coffman and Shor (1991). 


Proof. Let M, be the set of all functions f on [0, 1] with f (0) = fd) = 1/2 
and || fil; < 1, i.e., |f) — f| < |x — u| for 0<x<u x< 1. Then 0 < 
f(x) < 1 for 0 < x < 1. For any f : [0, 1] > [0, 00) let Sp be the subgraph 
of f, Sp := S(f) := {(x,y): 0<x <1, 0< y< f(x)}. Then for each 
f € Mı we have Sp € C(1, 2,2). Let Sı := (Sp: fe Mı} 

Let R be a counterclockwise rotation of R? by 45°, R = 2~!/2(1 7"). Then 
for each f € Mı, RT!(Sp) = M N R-!(IP) where M is a lower layer, I? := 
I x I,and 7 := [0,1]. So, it will be enough to prove the theorem for Sı C 
Cd, 1,2). 

Let |x] denote the largest integer < x and let O < ô < 1/3. For each c 
large enough so that 57 (log c) > 1 and ø, functions fo < fi <--: < fi < 8L 
< +++ < gı < go will be defined for L := [67(log c)|, with f; := fit) := 
fj(@, t) and likewise g; form € QandO <t < 1. 

Each f; and g; will be continuous and piecewise linear in t. For each 
j=1,...,2', one of f; and g; will be linear on Jj; := [(j — 1)/2', j/2'], and 
the other will be linear only on each half, J;,,2;-1 and Jj41,2;, with f; = g; at 
the endpoints of J;;. Thus over J;;, the region 7;; between the graphs of f; and 
gi is a triangle. 

Let fo(x) := 1/2 forO<x <1. Lets := 1/(L + 1) and go(1/2) := 
(1 + s)/2. Given f; and g;, to define f;+ı and g;+1, we have two cases. 


Case 1. Suppose g; is linear on J;;, as in Figure 11.1. Let pg := (xx, yx) be the 
point labeled by k = 1,..., 10 in Figure 11.1. Then x, = (4j + k — 5)/2'*? 
fork = 1,...,5, xk = x10— for k = 6, 7, 8, x9 = x2 +1/(3- 2'+2) and x10 = 
x4 — 1/6 .2'+?) Thus T := T;; is the triangle pı p3 ps. Let Q := Qj; be the 
quadrilateral p3pi9p7po, and V := Vj; the union p2p3p9 U p3p4 pio of two 
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Pı Py P3 


Figure 11.1 


triangles. The triangles pı p2p7, P1p3Ps, P3PsPo, and paps p7 each have 1/4 
the area of T, Q has 1/3, and V has 1/6. 

Letop b= 1,2, ..0,5 Jo dass; 2', be i.i.d. Rademacher random vari- 
ables, so that P(p;; = 1) = P(p;; = —1) = 1/2, independent of the N. pro- 
cess. If N.(Qi;) > 0, or if N.(Q;;) = 0 and p;i; = 1, say event A;; occurs and 
set 2341 = gi on J;;. Then, the graph of fi+ı on J;; will consist of line seg- 
ments joining the points pı, p2, P7, pa, Ps in turn, so 7;41,2;-1 = Pi p2P7 
and 7;+1,2; = P7P4Ps. 

If A;; does not occur, i.e., Ne(Q) < 0 or N.(Q)=0 and pij = —1, set 
fi4i := fi on Jj; and let the graph of g;, 1 consist of line segments joining the 
points pı, Pg, P3, Pe, Ps, SO that Tj+12;-1 = pips p3 and T;+1,2; = P3PePs. 
This finishes the recursive definition of f; and g; in Case 1. The stated properties 
of f; and g; continue to hold. 


Case 2. In this case f; is linear on Jj;, see Figure 11.2. Then 7;; is the triangle 
pı ps p7. The other definitions remain the same as in Case 1. 


Lemma 11.12 For i =0,1,..., L, all slopes of segments of f; and g; have 
absolute values at most (i + 1)s < (L + 1)s = 1, so f; and g; are in Gi 2,1. 


Proof. For i = 0, the slope for fo is O and those for go are +s. By induction 
for i > 1, the slopes for f;-; are those for Pı P} and/or P3 Ps, and those for 
gi—1 are those of Pı Py and/or P7Ps and by induction assumption are at most 
is in absolute value. There is an integer m = mj; = mj;j(@) € Z with |m| <i 
such that pı ps has slope ms, p;p3 in Figure 11.1 and p7ps in Figure 11.2 
have slope (m — 1)s, while p ps in Figure 11.1 and pı p7 in Figure 11.2 have 
slope (m + 1)s. We have for each i and j that mj41,2;-1 — mij and mj+41,2; — 
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Figure 11.2 


mj; have possible values 0 or +1. Specifically, on Ajj in Case 1 or Af; in 
Case 2, Mi+1,2j = Mj+1,2j-1 = Mij. On At; in Case l, Mj41,2j-1 — Mij = —1 
and mj+1,2; — Mij = 1.Oron Aj; in Case 2, mj+1,2j-1 — Mij = l and mj+1,2; = 
mij = —1. Thus in each case, from i — 1 to i, the maximum absolute value of 


the slope of a segment increases at most by s, and the conclusions follow. 


Let (Q, u) be a probability space on which Ne and all p;; are defined. Take 


Qı := Q x [0, 1] with product measure Pr := p x A where À is the uniform 
(Lebesgue) law on [0, 1]. For almost all ¢ € [0, 1], £ is not a dyadic rational, so 
for each i = 1,2,..., there is a unique j := j(i,t),j =1,..., 2', such that 


t € Inti; := ((j — 1)/2', j/2'). Thus Int;; is the interior of J;;. The derivatives 
f! and g; are step functions, constant on each interval Intj;1,;, and possibly 
undefined at the endpoints. 

Some random subsets of J? are defined as follows. For m = 0, 1,2,..., let 


m 2” 


For j =1,...,2”*1, let G(m, j) := G(m)U Ue Om+1.i- 

For each C in the Borel o-algebra B := Bd’), k=0,1,..., and i = 
1,..., 2%, let BE) be the smallest o-algebra of subsets of &2 with respect to 
which all Pmr form =0,1,...,k — l orm =k, r=1,...,i, and all N,(A), 
for Borel A C C, are measurable. It is easily seen that for each fixed k and 
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i, (BE? : C € B} is a filtration, and that {N,(A), BE) sen has independent 
pieces as defined before Lemma 11.8. 

The initial triangle Tọ; is fixed and has area s/4. For each k = 1,2,... 
and each possible value of the triangle 7,_;,;, there are two possible values of 
the pair of triangles Tk 2;—-1, Tk 2;, each with area afar". This and the area 
s/B3- 4k+1) of Q,; are nonrandom. For each m = 0, 1,2,..., there are finitely 
many possibilities for the values of Tą; and so Q;; for k =0,1,...m and 
i =1,...,2*, each on a measurable event, and for whether Ag; holds or not. 
So w œ> N,(Qi;(@))(@) is measurable, and for each m and j, there are finitely 
many possible values of G(m, j)(@). Next we have: 


Lemma 11.13 For eachk =0,1,2,... andi =1,...2*, 


(a) G(k, i) is a stopping set for Be) aes 

(b) An € BE = BE»: 

(c) If k > 1, then for each possible value Q of Qri, {Ox = Q} € BY”, 
where BO := BE for r=0,1,.... 


Proof. It will be shown by double induction on k and on i for fixed k that (a), 
(b), and (c) all hold. Let B® := BY? for any A € BUI?) and k =0,1,.... 
For k = 0 we have G(0) = G(0, 1) = Qo), a fixed set. So as noted just before 
Lemma 11.8, G(0) is a stopping set, and B® = Bea where Bee, is defined 
as for stopping sets. So (a) holds for k = 0. 

Since 14, is a measurable function of N.(Qo1) and 91, we have Ag; € BO, 
so (b) holds for k = 0. 

In (c), each event {Q;; = Q} is a Boolean combination of events A j, for j = 
0,...,k—landr =1,...,2/. Thus if (a) and (b) hold for k = 0, 1,..., m — 
1 for some m > 1, then (c) holds for k = m. So (c) holds for k = 1. 

To prove (a) for k = m, let J(m, j) be the finite set of possible values of 
G(m, j). If (c) holds for k = 1, ...,m, then from the definition of G(m, j), for 
each G € J(m, j), 


{G(m, j) = G} € B™». (11.8) 
Thus, for any D € B(I7), since G(m — 1) C G(m, j), 
{G(m, j) C D} = {G(m, j) C D} NA {Gm — 1) c D} 


U {Gan j) =G}N{Gn- 1c D} € Bp”, 
GeJ(m,j),GCD 


so G(m, j) is a stopping set and (a) holds for k = m. 
Then to show (b) for k = m, for a set C € B(/*) andi = 1,..., 2” let 


Coni) = {N(C) > O}U{NAC) = 0, Pmi = 1}. 
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Then 
Coni € BEP. (11.9) 


If C(m, i) is the finite set of possible values of Q mi, 


Ami = U {Qmi = Q} N Qm,i) 


QEC(m,i) 


Thus 


Ami N {G(m) C D} 


= {Gin € D} A {Gin-1 Cc D} n| U {Qni = Q} A Omn) 


QEC(m,i), QCD 


It follows from (11.8) that {G(m, i) C D} € B™-DÐ for each m and i. By (c) 
for k = m, each {Q mi = Q} € B-D., Thus 


{Gim, i) G D} N {Qmi = Q} E€ Ba), 
and so 
{G(m, i) C DIN {Qn = OQ} N{Gim—-—1) C D} € BED, 


For Q C D, by (11.9), Q(m,i) E BOD, So taking a union of intersections, 
Ami O{G(m, i) C D} € BB”. Thus Ami € BGG.) and (b) holds for k = m. 
This finishes the proof of the Lemma. 


Next, here is a fact about symmetrized Poisson variables. 


Lemma 11.14 There is a constant co > 0 such that if X and Y are independent 
random variables, each with distribution Poisson with parameter à > 1, then 
E|X — Y| > coà. 


Proof. By a straightforward calculation, E((X — Y)*) = 2A + 1222 < 1427 if 
à > 1. By the Cauchy—Bunyakovsky—Schwarz inequality, 


2. = E(X — YP) < (E|X — Y) EUX — YP) 


and E(|X — Y|’) < (E(X — Y|*^)/4. The Lemma follows with co = 
4/144. 


Now continuing with the proof of Theorem 11.11, set G(m, 0) := Gm-1. 
We apply Lemma 11.13(a) and (c). For m = 1,2,... and j = 1, ..., 2”, the 
random set Qmj is disjoint from the stopping set Gm, j—1. The hypotheses of 
Lemma 11.8 hold for (G, A) = (Gm, j-1, Qmj) and the process Y = N, with 
the filtration Bei D. A €e B}. Since all possible values of Q, j have the 


P1: KNP 


CUUS2019-11 CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


11.4 Lower Bounds in Borderline Cases 407 


same area, it follows that the random variable Ne(Qmj) has the law of U — V 
for U, V i.i.d. Poisson with parameter 


cP(Qnj) = cs/(6-4") (11.10) 


and is independent of B”-/—', which was defined in Lemma 1 1.13(b). Also, 
N.(O,i) is B"/—-' measurable for r < m or for r =m and i < j. So, the 
variables Ne(Qmj) form =0,1,2,... and j =1,...,2” are all jointly inde- 
pendent with the given laws. Also, the i.i.d. Rademacher variables pmj are 
independent of the process Ne. 

Fort € Int;;, let h;(@, t) be the slope of the longest side of T;;, so hj(w, t) = 
g;(t) in Case 1 and f/(t) in Case 2, while ho(w, t) = 0. In Case 1, if Aj; holds 
then hist = h; on Intij. If w ¢ Ajj then hist — h; = —s on Inti+1,2j-1 and 
+s on Intj+1,2;. 

In Case 2, if w ¢ Ajj then hj+1 = hi on Intij. Ifo e Aij, then hi41 = h; = sS 
on Intj+1,2j-1 and —s on Intj+1,2;- 

Let ¢& := h; — hi—ı for i = 1,..., L. By Lemmas 11.8 and 11.13, and 
(11.10), any Ne(Qmj) OF Pmj for m > k is independent of BED, where BOP 
is defined as the trivial o-algebra {ø, Q1}. So, each of ¢),..., Sq is BED 
measurable while any event A;; and the random function ¢z+1 are independent 
of B-D, It follows that hj(@, t) = $i ¢.(@, t) where ¢, are i.i.d. variables 
for Pr = u x AonQ x [0, I] having distribution Pr(¢, = 0) = 1/2, Pr(¢, = 
s) = Pr(¢, = —s) = 1/4. Since ¢, are independent and symmetric, we have 
by the P. Lévy inequality (Theorem 1.20) that for any M > 0, 


Pr(\h,|>M forsome r < L) < 2Pr(|hz|>M). (11.11) 


Now, we can write ¢ = 12,1 + 12, where 7; are i.i.d. variables with Pr(n; = 
s/2) = Pr(nj = —s/2) = 1/2. So by one of Hoeffding’s inequalities (Propo- 
sition 1.12 above), we have Pr(|hz| > M) < exp(—M?/(Ls?)). For c large, 
(1 — 2s)? > 1/2. Thus 


Pr(\h,| > 1—2s for some r < L) 
2exp(—(1 — 2s)*/(Ls)) 

2 exp(— log c/(267 log c)) 

2 exp(—1/(267)). 


IA 


(11.12) 


IA 


Let x(w, t) := inf{k: |hg(w,t)| > 1—2s} < +00. Then | f/(1)| < 1 for 
all k < k(w, t). Let P(w, t) := fmin, (Œ, t). Then Py € Mı for each k. If 
|h,(t)| < 1 — 2s for all r < L, then | f/(t)| < 1 and |g.(Ð)| < 1 for all r < L, 
k(w,t) > L, and ®;(@,t) = fr(@, T). 

Let xri(@) := 1 if k(æ,t) > k for some, or equivalently all, t € Intri, 
otherwise let x;,;(w) =0. For any real x let xt := max(x, 0). For k = 
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0,1,...,L—1andi=1,...,2* we have Qn C S(fk+1) if and only if Ag; 
holds. So, Qi C S(®z(@, -)) if and only if both Ag; holds and xz; = 1. Let 


ai = lags 
L-1 2% 
SoL := Xki Ne( Qk)", (11.13) 
k=0 i=l 
L=] -24 
Sv := Vki (11.14) 
k=0 i=l 
where Vki := XkiäkiNe(Vķi). Also, let Uki := XrziNe(Qr)" and W := 


Ne(S(fo)) where S( fo) = [0, 1] x [0, 1/2]. Then S( fp) is the union, disjoint 
up to 1-dimensional boundary line segments, of S( fo) and of those Q,,; and V,; 
with r < k such that A,; holds. So 


N(S(®1)) = Sor + Sv, + W. (11.15) 


From the definitions, we can see that the set where x(w, t) < k is B4) mea- 
surable for each ft, and thus so is each x;;. Also, Qg; is a B® measurable 
random set, disjoint from G(k — 1). Since P(Q;;) is fixed, by Lemma 11.8, 
N.(Q;i)*, a function of N,(Q,;), is independent of B¢—) and thus of x,;. 

Similarly, ag; is independent of xg; by Lemma 11.8. Both are measurable 
with respect to Bi = Bee x while V;; is disjoint from G(k, i), and P(V;;) 
is a constant. So by Lemma 11.8 again, N.(V;i) is independent of Be.i, 
and the three variables N.(Vii), ag; and x; are jointly independent. Since 
L < (logc)/9, in (11.10 the parameters of the Poisson variables for k < L 
are > cs/(6- 4") > 1 for c large enough. By (11.12) and the Tonelli—Fubini 
theorem we have for ô < 1/4 and k = 0,1,..., L — 1 that 


2k 
1 
1/2 < 1—2exp(—1/(267)) < Prk > k) < yA Eie (11.16) 
i=1 


Let Xzi := Ne(Qxi)t and py := Pr(xui = 0). Then by (11.16), 


2k 
XO pu < 2+ exp(—1/(28°)). 


i=1 


For each k and i, E[(1 — xei)Xui] < p} (EX2)"?. Thus by (11.10), setting 
_ k 

Si := ae yha — Xki)Xķki, we have 
L-1 


cs 
Ae) 
A 


< J2 exp(—1/(287))]'/7[2*es /(6 - 4°) 
k=0 


= 67"? L(cs)"”” exp(—1/(487)) < (8? logc)(cs)'/? exp(—1/(48)), 
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so 
ES, < c\(logc)*/48? exp(—1/(48°)). (11.17) 
Letting S$. := o yi Xxi, we have by Lemma 11.14, and since by sym- 
metry EN-(Qki)* = E|N.(Qxi)|/2, that 
L-1 
ES, = $ 2% teo(es/(6 - 4)" 
k=0 


247! egL (cs)? > c18?c! (og cy", 


where c} := co/10. By (11.17) and Markov’s inequality we have for any 
a > 0, 


Pr(S; > a8c!/*(log c)*/*) < a7! exp(—1/(467)). (11.18) 


If j <k or j =k andi <r, then Xj; is measurable for Bgy,--1) while by 
Lemma 11.8, Xx, is independent of it. Thus all the variables X ;; are indepen- 
dent, and by (11.10), 


E=1, 7 L-1 
Var(S>) = XOP Var(X ji) < X > 2%cs/(6-4/) < cs = c/(ogce)!”. 
j=0 i=l j=0 


Thus S> has standard deviation < c!/?/(log c)!/4, and 
Sy — E'S) = 0,(E S2) as c > 00 (11.19) 


by Chebyshev’s inequality. 

To find the covariance Cov(V;;, Vi) of two different terms of Syz in (11.14), 
we can again assume j < kor j =k andi <r. Let Y,, := Ne(Vuv) for each 
u and v. Since EV,,, = 0, we need to find 


EF jikr = E(X iii ji XkrOkr Ver) = EQ Gi NeW) XkrGkr Yer) 


where Wj; := Vj; on Ajj and Wj; := Ø otherwise. Then W;; and Vg, are 
B*"-measurable random sets disjoint from the stopping set G(k, r), and each 
of the three random sets has finitely many possible values. Thus Lemma 11.8 
applies to G = G(k,r), A= Wj; and B = V;,. Also, AN B = Ø. Thus we 
have 
E jitr = EE(xjiNe(Wji)XerQur Yer IBY") 
= E [Xji Xerar E(Ne(Wji)Ne(Vir)|B*’) | 
E [XjiXkrkr E (NAW; |B) E(Ne(Vir)|B“")] = 0 
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since the last conditional expectation is 0, and E(VjiVir) = 0 = Ej) = 
E(Vx-). So the terms are orthogonal, and 


L-1 2/ L-1 
Var(Sv 1) = D> >) VaV; < JENA) 
j=0 i=1 j=0 
L-1 
< $ 2tes/(12-4) < cs. 
j=0 


Also, for W in (11.15), Var(W) < c. Thus by Chebyshev’s inequality, Sy, = 
0,p(ES2) and W = 0 (E S2) as c > oo. Note that we have N.(S(®:)) = 
SoL = S2 — Sı. Then by (11.15) and (11.19), 


N(S(®z)) = ES + (Sp — E S2) — Sı + Sv +W = ES — Sı + OplE S2). 


Taking æ := cı/3 in (11.18), then ô > 0 small enough so that (3/c1) exp(—1/ 
(457)) < £€/2, we have for c large enough that 


Pr {NAėS(®1)) > yc’ dogo?} > 1-e 


where y := c18?/3. Since ®; € Mı, the conclusion for N, follows. 
Now, suppose the theorem fails for the centered Poisson process Y,, for some 

€ > 0. Thus, for arbitrarily small 5 > 0, Pr [Il Yellea.2.2 < dc!/?(log c)3/4/2] > 

£. Then, subtracting two independent versions of Y, to get an N., we have 

Pr [| Nellea,2,2) < 6c'/*(log c)*/4] = e?, a contradiction. 
The v, case follows from Lemma 11.7. 


11.5 Proof of Theorem 11.10 


Proof. First, there are measurability properties to consider. For C(a, K, d), 
let a = B+ y where 0 < y < 1 and £ is an integer > 0. The set G(a, K, d) 
of functions is compact in the ||: || norm: it is totally bounded by the Arzela— 
Ascoli Theorem (RAP, Theorem 2.4.7) and closed since uniform (or pointwise) 
convergence of functions g preserves a Hölder condition |G(x) — G(y)| < 
K\|x — y|”. Recall that for x € R? we let Xa) t= (%1,...,Xq-1). The set 
{(x,g): 0 < xa < g(x@), g € Gla, K, d)} is compact, for the £ norm on g. 
So the class C(@, K, d) is image admissible Suslin. 

Measurability for the lower layer case C = LLa follows as in the case 
d = 2 treated soon after the statement of Theorem | 1.10. For the case C = C3 
of convex sets in R?, for any F C {X,,..., Xn} there is a smallest convex 
set including F, its convex hull (a polyhedron). The maximum of P, — P 
will be found by maximizing over 2” — 1 such sets. Conversely, to maximize 
(P — P,)(C), or equivalently P(C), over all closed convex sets C such that 
CN {X,..., Xn} = F, recall that a closed convex set is an intersection of 
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closed half-spaces (RAP, Theorem 6.2.9). If |F| = k, it suffices in the present 
case to take an intersection of no more than n — k closed half-spaces, one 
to exclude each X; ¢ F, j < n. Thus ||P, — Pllc = ||P, — Pllp for a finite- 
dimensional class D of sets for which an image admissible Suslin condition 
holds. It follows that the norms || Yx ||c¢ and ||v, ||c are measurable for the classes 
C in the statement. 

Theorem 11.10 for Y, implies it for v, by Lemma 11.7. 

By assumption d > 2. Let J := [0, 1), so that J! = {x e R! : 0< 
x; <1 for j=1,...,d—l}. 

Let fı be a C% function on R such that f\(t)=0 fort <0 or t > 1, 
for some «x with O < «x <1, f(t)=« for 1/3 < t < 2/3, and 0 < f(t) < xk 
for t € (0, 1/3) U (2/3, 1). Specifically, we can let fi(t) := (g *h)(t) := 
SZ 8t — y)h(y)dy whereh := 111/6,5/6) and g isa C% function with g(t) > 0 
for |t| < 1/6 and g(t) = 0 for |t| > 1/6. For x € R^! let fœ) := fix): 
fil(x2) -+ - fi(xa-1). Then f is a C” function on R^! with f(x) = 0 outside 
the unit cube J4—!, 0 < f(x) < y for x in the interior of J®-!, and f=y 
on the subcube [1/3,2/3]4~! where 0 < y := «=! < 1. Taking a small 
enough positive multiple of fı we can assume that sup, sup, ,j;<q_; IDP f(x)| < 
1, possibly with a smaller y > 0. 

Some indexed families of sets and functions, some of them random, will be 
defined as follows. For each j = 1,2,..., the unit cube J d-l igs decomposed 
as a union of 3/¢—)) disjoint subcubes C ;; of side 3~/ fori = 1, 2, ... , 3767D, 
where each Cj; is also a Cartesian product of left closed, right open intervals. 
Let Bj; be a cube of side 3-J-! concentric with and parallel to C;;. Let xj; be 
the point of C;; closest to 0 (a vertex) and yj; the point of B;; closest to 0. For 
ô > Qand j =1,2,...,letc; := Cj~'dog(j + 1))~!*, where the constant 
C > Qis chosen so that Dj cj < 1. For x € R^! let 


fuia) = €j3 SY FBI — xj), 
Bile) = j3 VIED FAM — yji). 


Then fj;(x) > 0 on the interior of Cj; while fj;(x) = 0 outside C,;, and like- 
wise for gj; and B;;. Note that 


Fji%)/gji(x) = 37-1 > 1 forall x€ Bij: (11.20) 
Let So := 1/2. Sequences of random variables sj; = +1 and random functions 
on [471 
k 3hd-) 


S = Sot) YO sio fii (11.21) 


j=l i=l 


16:32 


P1: KNP Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 
CUUS2019-11 CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 16:32 


412 11 Classes Too Large to be Donsker 


will be defined recursively. Then since d > 2, we have ee cj3 IE) < 1/3, 
so0 < & <1 forall k. Given j > 1 and $;_1, let 


Dji := Dilo) = {x € J’: |xa — Sj-r@w@(@)| < gila). (11.22) 


Let s;(@) := +1 if Y,(Dji(@)) > 0, otherwise s;;(@) := —1. This finishes 
the recursive definition of the sj; and the Sz. 

By induction, each Dj; has only finitely many possible values, each on a 
measurable event, so the sj; and S; are all measurable random variables. Since 
the cubes Cj; do not overlap, and all derivatives D” fj; are 0 on the boundary 


of C;;, it follows for any k > 1 that 
SUP p}<d-1 sup, |D? Sk(x)| = Zi- SUP[ p}<d-1 |D? fal (11 23) 
< ae Cj < 1. : 
The volume of Dj;(@) is always 
P(Dji(o)) = af gjidx = 2ucj/90+D0-D (11.24) 


Jt 


where 0 < u := fyi f(x)dx < 1. 
Next, it will be shown that D ;;(œ) for different i, j are disjoint. For 1 < j < 
k and any w,i and x € Bj;, 


| S(x) — Sj-1(*)|(@) = yej3 te) = y ye, 3 TD 


r>j 


; 1 
3-id-D (1L 
YCj 7 


ycj3 (4/2 > sup(gji + sup gr+1,)). 
y r 


IV 


IV 


Thus if s;;(@) = +1, then for any r and any x € Bj;, 
Sk(X)(@) — gk+1,r Œ) = Sjo) + gj). 


It follows that Dj;;(@) is disjoint from D;+1,-(@). They are likewise disjoint if 
sji(@) = —1 by a symmetrical argument. For the same j and different i, the 
sets Dj;(w) are disjoint since the projection x +> xq) from R? onto R”! takes 
them into disjoint sets B jj. 

Given A > 0 let r = r(A) be the largest j, if one exists, such that 2Apc; > 
9U+DA-D. Then 


r(A) ~ (loga)/(d — 1)log9) as à > +00. (11.25) 
Let Go(m) := Ø. Form =1,2,..., let 


Gmo) := Galo) = |_J{Dji(@): j <m}. 
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Let H,,(@) be the subgraph of Sm, Hin(@) := {x : O < xa < Sn(x(a)(@)}. Then 
for all w, by (11.23), Hm € C(d — 1,1, d). Let An(@) := Hin(@) \ Gin(@). 
From the disjointness proof, 


Hn(®) 1 Gm(o) = JD): j <m, sji = +1}. (11.26) 


For each m, each of the above sets has finitely many possible values, each on a 
measurable event. 

Each G; is easily seen to be a stopping set, while A ;(-) is Bg) measurable 
and for each w, A ;(@) is disjoint from G ; (w). So the hypotheses of Lemma 11.8 
hold. Thus, conditional on Bgg), X,(A;) is Poisson with parameter A P(A ;(@)). 
Also, for ut := max(u, 0), by (11.26), 

m 3i(d-) 
Ya (Hm N Gm\(@)) = >> > OD; D. (11.27) 
j=l i=l 

Now, P(Dji) does not depend on œw nor i, and Djj(-) is Bgg-1) mea- 
surable. Thus by Lemma 11.8, applied to A := Dmi, replacing Gm by 
Gm-1 U Uzi Dmr, the Y,(Dj;) for different i or j are jointly independent, 
and each has the law of Y, (D) for a fixed set D with P(D) := 2 f gjidx. 

Taking m := r(A), equation (11.27) gives a sum of independent nonnegative 
parts of centered Poisson variables with parameters 4 P(Dj;) > 1, by (11.24). 
So by Lemma 1.18, for a constant c > 0, 


r(a) 3/4) 
AP EY (Hm N Gn) = cX D> PO; 


j=l i=l 


rà) 
= c 3D (24c; /90tD-D)"? by (11.24) 
j=l 
rà) 
= 32W) X (Cj dogi +1) (by def. of c;) 
j=l 
r(a) 
= e(2ncy'?3I4 Y jogg + Dr 


j=l 


ra) 
aq(log(r(a) + 1) 958 Sj? 


j=l 


IV 


IV 


2aalr (à)! — 1)(logr (A) + 1)-°3-? 


for some constant aq > 0. For à large, by (11.25), the latter expression is 
> 3b4(log A)!/?2(log log A)~°>* for some by > 0. 
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By independence of the variables Y, (D ;;) and (11.24), Y,((A-ay O Gray (@) 
has variance less than 


r(A) 3/4-D rà) 
SOS PO Hay cg PE < A, 
j=l i=l j=l 


Thus by Chebyshev’s inequality, as à —> +00, 
Pr(A YCH, o) N Groo) > 2ba(log 2)!/*(log log 4)~°>-9} > 1. 


As shown around (11.27), Y,(A-a)(@))(@) has, given Bero, the conditional 
distribution of Y,(D) for P(D) = P(A œ (@)), where since Y, is a centered 
Poisson process, EY, (DÝ < A for all D. Thus EY, (Dro) <i, Y, (Aro) /A! 
is bounded in probability, and 


Pr{Y,(H,o)) = ba(A log a)'/*(loglog A) 05°} > 1 


as à —> +00. This proves Theorem 11.10 for Y}, since given the result for 
K = 1, one can let y(d, K,5) := Ky(d,1,6). For v,, Lemma 11.7 then 
applies. This completes the proof for classes C(d — 1, K, d). For convex sets 
in R? see Dudley (1982, Theorem 4, proof in Section 5). Lower layers were 
treated in Section | 1.4. 


Problems 


1. Find a lower bound for the constant c(d, œ) in the proof of Theorem 11.1. 
Hint: c(d, a) > 3-7“ P(f)/II f le- 

2. In the proof of Theorem 11.2, show that r can be taken as n + 1 for the 
largest n = 0, 1,... such that m(n) = 0. 

3. If (Y, || ||) is any Banach space, in particular (€°(F), || - I| F) for any F, for 
any y > 0 there is a bounded continuous (in fact, Lipschitz) function H > 0 on 
Y such that H(y) > 0 if and only if ||y|| < y. Use this to prove the statement 
made in the Remarks after Theorem 11.1 that G(d/2, 1, d) is not P-Donsker 
for P = U(I“). Hint: If Y is 1-dimensional, say Y = R, find such a function 
of the form h(|y|). In general take h(||y||). 

4. What should replace n? (log n)” in Theorem 11.4 if 0 < ¢ < 1? Hint: Any 
sequence an — +00. 

5. To show that Poisson processes are well-defined, show that if X and Y 
are independent real random variables with £(X) = P, and L(Y) = P, then 
L(X +Y) = Paro. 

6. Let P bea law on R and U, the Poisson process with intensity measure c P for 
somec > 0. Fora give y > Olet X be the least x > Osuch that U.([—x, x]) => y. 
Show that the interval [— X, X] is a stopping set. 
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Notes 


Notes to Section 11.1. Bakhvalov (1959) proved Theorem 11.1. W. Schmidt 
(1975) proved Theorem 11.3. Theorem 11.2, proved by the same method as 
theirs, also is in Dudley (1982, Theorem 1). 


Notes to Section 11.2. Walter Philipp (unpublished) proved Theorem 11.4 for 
¢ = 1, with a refinement (7 = 1 and a power of log log n factor). The proof for 
¢ > 1 follows the same scheme. Theorem 11.4 is Theorem 2 of Dudley (1982). 


Notes to Section 11.3. Lemma 11.5 is classical, see e.g. Feller (1971, VIII.8, 
Lemma 2). Lemma 11.6 and with it the main idea of Poissonization are due 
to Kac (1949). Pyke (1968), along the line of Lemma 11.7, gives relations 
between Poissonized and non-Poissonized cases. Evstigneev (1977, Theorem 
1) proves a Markov property for random fields indexed by closed subsets of a 
Euclidean space, somewhat along the line of Lemma 11.8. This entire section 
is a revision of Section 3 of Dudley (1982), with proofs of Lemmas 11.5 and 
11.6 supplied here. 


Notes to Section 11.4. Theorem 11.10 was proved in Dudley (1982) for all 
d > 2. Another proof was given for lower layers and d = 2 in Dudley (1984). 
Shor (1986) discovered the larger lower bound with (log ny! in expectation, 
and contributed to a later version of the proof (Coffman and Lueker, 1991, pp. 
57-64). Shor gave the definition of f; and g;. Theorem 11.11 here shows that 
the lower bound holds not only in expectation, but with probability going to 1, 
by the methods of Dudley (1982, 1984). I thank J. Yukich for providing very 
helpful advice about this section. 


Notes to Section 11.5 The proof is from Dudley (1982). 
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Appendix A 


Differentiating under an Integral Sign 


The object here is to give some sufficient conditions for the equation 


d af (x, 
© f fæ ndue) = i x Ò lut. (A.1) 


Here x € X, where (X, S, u) is a measure space, and f is real-valued. The 
derivatives with respect to t will be taken at some point t = fo. The function f 
will be defined for x € X and ¢ in an interval J containing fo. Assume for the 
time being that fp is in the interior of J. Let 


Sax, to) := Of (x, t)/3t l=- 


Some assumptions are clearly needed for (A.1) even to make sense. A set 
F CLI(X,S, m) is called £'-bounded iff sup{f|f|du: f € F} < oo. Here 
are some basic assumptions: 


For some 6>0, {f(,t): |t — to] < ô} are L'-bounded.; 
fa(x, to) exists for u-almost all x, (A.2) 
and fo(', to) € L'(X, S, u). 
Here is an example to show that the conditions (A.2) are not sufficient: 


Example. Let (X, u) be [0, 1] with Lebesgue measure. Let J = [—1, 1], = 
0. Let f(x, t) := 1/t ift > Oand0 < x < t, otherwise let f(x, t) := 0. Then 
for all x #0, f(x, t) = 0 for t in a neighborhood of 0, so fo(x, 0) = 0 for x Æ 
0. Foreacht > 0, S) f(x, dx = 1, while f} f(x, 0)dx = 0. So the function 
tr f i f(x, t)dx is not even continuous, and so not differentiable, at t = 0, 
so (A.1) fails. 

The function f in the last example behaves badly near (0,0). Another 
possibility has to do with behavior where x is unbounded, while f may be very 
regular at finite points. Recall that a function f of two real variables x,t is 


419 
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called C® if the partial derivatives 0?T4 f/0x?90t4 exist and are continuous for 
all nonnegative integers p, q. 


Example. For J = (-—1,1), X =R with Lebesgue measure u, and 
to = 0, there exists a C% function f on X x J such that for O < t < 
1, SZ- f(x, dx = 1, while for tọ = 0, and any x, f(x,t) =0 for t in 
a neighborhood of 0 (depending on x), so f(x, 0) = fo(x,0) =0. Thus 
Soo f(x, 0)dx = 0 while f°, f(x, t)dx is not continuous (and so not dif- 
ferentiable) at 0 and (A.1) fails. To define such an f, let g be a C% func- 
tion on R with compact support, specifically g(x) := c-exp(—1/(x(1 — x))) 
for 0<x <1 and g(x) := O elsewhere, where c is chosen to make 
Io B(x) dx = Sè g(x)dx = 1. It can be checked (by L’Hospital’s rule) that 
g and all its derivatives approach 0 as x | 0 or x t 1, so g is C. Now let 
f(x,t) := g(x—t') for t >0 and f(x,t) := 0 otherwise. The stated 
properties then follow since f(x, t) =Oift <Oorx <0or0 <t<I1/x. 


For interchanging two integrals, there is a standard theorem, the Tonelli— 
Fubini theorem (e.g., RAP, Theorem 4.4.5). But for interchanging a derivative 
and an integral there is apparently not such a handy single theorem yielding 
(A.1). Some sufficient conditions will be given, beginning with rather general 
ones and going on to more special but useful conditions. 

A set F of integrable functions on (X, S, u) is called uniformly integrable 
or u.i. if for every £ > 0 there exist both a set A with u(A) < o such that 


f fidu <e forall f ef, (A.3) 
X\A 


and an M < oo such that 


I |fldu<e forall feFf. (A.4) 
|fl>M 

In the presence of (A.3), condition (A.4) is equivalent to: F is L '_bounded and 
for any £ > 0 there is a ô > O such that 


forany C €e S with uw(C) <6 andany f € F, f\|fldu<e;  (A.5) 
C 


this was proved in RAP (Theorem 10.3.5) for probability measures, where 
(A.3) is vacuous. The proof extends easily to the general case. 

The best-known sufficient condition for uniform integrability is known as 
domination. The following is straightforward to prove, via (A.4): 


Theorem A.1 Jf f € £'(X,S, p) and f > 0, then {g € L'(X, S, u): |g| < 
f} is uniformly integrable. 


Here |g| < f means |g(x)| < f(x) for all x € X. But domination is not neces- 
sary for uniform integrability, as the following shows: 
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Example. Let (X, S, m) be [0, 1] with Lebesgue measure. Let g(t, x) := |t — 
x|~'/?, 0<t<1, 0 <x < 1. Then the set of all functions g(t, -), 0 < t < 1, 
is uniformly integrable but not dominated. 

Uniform integrability provides a general sufficient condition for (A.1). Let 


Arf, to) := fŒ, to +h) — f(x, to). 


Theorem A.2 Assume (A.2) and that for some neighborhood U of 0, 
{An f(x, to)/h : 0 Ah € U} are uniformly integrable. (A.6) 
Then (A.1) holds and moreover 
S| AnfO, to)/h — f(x, to)|du(x) > 0 as h > 0. (A.7) 
Conversely, (A.7) and (A.2) imply (A.6). 


Proof. (A.2) implies that the difference-quotients A} f(x, to)/ h converge point- 
wise for almost all x as h — 0 to 3f (x, t)/0t|+=,,. Pointwise convergence and 
uniform integrability (A.6) imply £! convergence (A.7): this was proved in 
(RAP, Theorem 10.3.6) for a sequence of functions on a probability space. The 
implication holds in the case here since: 


(a) By (A.3), we can assume that up to a difference of at most € in all the 
integrals, the functions are defined on a finite measure space, which reduces by 
a constant multiple to a probability space; 


(b) If (A.7) fails, then it fails along some sequence h = hp — 0, so a proof for 
sequences is enough. Thus (A.7) follows from the cited fact. 


Now suppose (A.7) and (A.2) hold. Then given ¢ > 0, there is a set A of 
finite measure with fy, | fo(x, to)ldu(x) < €/2, and for h small enough, 
S \Anf(@, to)/h — folx, to)|du(x) < e/2, so (A.3) holds for the functions 
Anf/h, h € U, as desired. 

For sequences of integrable functions on a probability space, convergence 
in £! implies uniform integrability (RAP, Theorem 10.3.6). Here again, we can 
reduce to the case of probability spaces, and if (A.6) fails, for a given e, then it 
fails along some sequence, contradicting the cited fact. 


But given (A.2), the uniform integrability condition (A.6), necessary for 
(A.7), is not necessary for (A.1), as is shown by a modification of a previous 
example: 


Example. Let (X, u) = [—1, 1] with Lebesgue measure, J = [—1, 1], and tọ = 
O.Ift > O,let f(x, t) = 1/tif0 <x <t, f(x, = —1/tif—t <x < 0, and 
f(x, t) = Ootherwise. Then f1; f(x, dx = Oforalltandxf!, fo(x, Odx = 
0 since f(x, 0) = 0, so (A.1) holds. 
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Domination of the difference-quotients by an integrable function gives the 
classical “Weierstrass” sufficient condition for (A.1), which follows directly 
from Theorems A.1 and A.2: 


Corollary A.3 Jf (A.2) holds and for some neighborhood U of 0, there is a 
g € L!(X, S, u) such that | A, f(x, to)/h| < g(x) for0 £h € U, then (A.1) 
holds. 


Corollary A.3 is highly useful, but it is not as directly applicable as is the 
Tonelli—Fubini theorem on interchanging integrals. In some cases, even when 
a dominating function g € £L! exists, it may not be easy to choose such a g for 
which integrability can be proved conveniently. 

We can take the neighborhoods U to be sets {h : |A| < 6}, for ô > 0. Fora 
given ô > 0, the smallest possible dominating function would be 


85(x) = sup{|An f(x, to)/A| : 0 < |A| < ô}. 


If (X, S) can be taken to be a complete separable metric space with its o -algebra 
of Borel sets, and if f is jointly measurable, then each gs is measurable for the 
completion of u (RAP, Section 13.2). If so, then the hypothesis of Corollary 
A.3, beside (A.2), is equivalent to saying that 


for some ô > 0, g; € L!(). (A.8) 


So far, nothing required the partial derivatives fo(x, t) to exist for t Æ tọ, 
but if they do, we get other sufficient conditions: 


Corollary A.4 Suppose (A.2) holds, f(-, -) is jointly measurable, and there is 
a neighborhood U of ty such that for allt € U, f2(x, t) exists for almost all x 
and the functions f2(-, t) for t € U are uniformly integrable. Suppose also that 
forteU, f(x,t)— f(x, to) = Fis a(x, s)ds for -almost all x. Then (A.6) 
and so (A.7) and (A.1) hold. 


Proof. The functions /f2(-,t) for t € U, being uniformly integrable, are £!- 
bounded. We can take U to be a bounded interval. Then fy fy | fa(x, 1)|du(x)dt 
< oo. Now for |h| small enough, 


1 |h| 
< zS |fa(x, to + u)|du < œ. 


1 h 
lAa f(x, to)/h|=|— | fx, to +u)du 
h Jo In| 


Thus the difference-quotients A; f(, fo)/h are also L'-bounded for h in 
a neighborhood of 0. Condition (A.3) for the functions f2(-,¢) implies 
it for the functions A, f(-, %)/h. Also, condition (A.5) for the functions 
fal, to + u) directly implies (A.5) and (A.6) for the functions Aj; f(x, to)/h as 
desired. 


Three clear consequences of Corollary A.3 will be given. 
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Corollary A.5 Suppose f(x, t) = G(x,t)H(x) for measurable functions 
G(., -) and H where (A.2) holds for f. Suppose that for h in a neighborhood of 
0, |AnG(x, to)/ h| < a(x) for a function a such that f \a(x)H(x)|du(x) < oo. 
Then (A.6), (A.7), and (A.1) hold. 


Corollary A.6 Let G(x, t) = o(x — t) where ¢ is bounded and has a bounded, 
continuous first derivative, X = R, u = Lebesgue measure, and H € L'(R). 
Then the convolution integral (ġ * H)(t) := f°, (t — x)H(x)dx has a first 
derivative with respect to t given by (o * HY (t) = SJS Q(t — x) H(x)dx. If 
the jth derivative ¢” is continuous and bounded for j =0,1,...,n, then 
iterating gives (@ * H)(t) = f=. ot — x)H(x)dx. 


Corollary A.7 Let f(x, t) = e"*® H(x) where w and H are measurable real- 
valued functions on X such that H and WH are integrable. Then (A.6), (A.7), 
and (A.1) hold. 


Proof. Let G(x, t) := e’Y, For any real t, u, and h, (étt — e'™| < |uh| 
(RAP, proof of Theorem 9.4.4). So |A, G(x, t)/h| < |w(x)|. Since 


Zenona) = iye" ®H (x) 


and |iy(x)e"t®] = |w(x)|, conditions (A.2) hold and Corollary A.5 
applies. 


Note. In Corollary A.7, if X = R and w(x) = x, we have Fourier transform 
integrals. 


Proposition A.8 Let Y be a measurable function of x and suppose that n(t) := 
fel¥@du(x) <œ for all t in an open interval U. Then n is an analytic 
function having a power series expansion in t in a neighborhood of any to € U. 
Derivatives of n of all orders can be found by differentiating under the integral, 


n® (t) = f yore ano, n=1,2,..., forallteU. 


Proof. Take tọ € U. Replacing u by v where dv(x) = exp(tow(x))du(x), 
we can assume fọ = 0. Then for some e > 0, {t: |t] < e} C U, and for 
l<e, elf¥Ol < et~ 4 e740, an integrable function. Thus, n(x) = 
Ei t” f w(x)"du(x)/n! where the series converges absolutely by dominated 
convergence since the corresponding series for e!"” does and is a series of 
positive terms dominating those for e’”). The derivatives of 7 at 0 satisfy 
n™(0) = S y(x) d u(x): this holds either by the theorem that for a function 
n represented by a power series }\°°9 Cnt” in a neighborhood of 0, we must 
have c, = n™(0)/n! (converse of Taylor’s theorem), or by the methods of this 
Appendix as follows. For any real y, |e” — 1| < |y|(e” + 1) by the mean value 
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theorem since e < e” + 1 for t between 0 and y. Thus for any real u and 
h, \(e"" —1)/h| < ju|(e"} + 1). For any 6 > 0, there is a constant C = C; 
such that |u| < C(e™“ + e~*") for all real u. If |A| < ô := £/2, then since 
ett < e + e~" and e + e~™ is increasing in u > 0 for any real c, we have 


ie” — 1)/h| < 3Ca(e™ +e). 


Letting u = w(x), we have domination by an integrable function, so Corollary 
A.3 applies for the first derivative. The proof extends to higher derivatives since 
lW(x)|" < Kel¥@! for some K = Kq.. 


Notes to Appendix A. Perhaps the most classical result on interchange of integral 
and derivative is attributed to Leibniz: if f and df/dy are continuous on 
a finite rectangle [a, b] x [c,d] and c < y < d, then (d/dy) J f(x, y)dx = 
J} af (x, y)/ay dx. 

In the literature of this topic, “uniformly integrable” and even the existence 
of integrals are sometimes used with different meanings. Let F be a family of 
real-valued continuous functions defined on a half-line, say [1, oo). A function 
f in F will be said to have an (improper Riemann) integral if the integrals 
J a f(x)dx converge to a finite limit as M —> œœ, which may then be called 
f n f(x)dx. This integral is analogous to the (conditional) convergence of 
the sum of a series, where Lebesgue integrability corresponds to absolute 
convergence. An example of a non-Lebesgue integrable function having a 
finite improper Riemann integral is (sin x)/x. The family F is called uniformly 
(improper Riemann) integrable if the convergence is uniform for f € F. For 
a continuous function f of two real variables having a continuous partial 
derivative 0f/dy, uniform improper Riemann integrability with respect to x, 
say on a half-line [a, 00), of df(x, y)/dy for y in some interval, and finiteness 
of ip ~ f(x, y)dx, imply that the latter integral can be differentiated with respect 
to y under the integral sign, e.g., Ilyin and Poznyak (1982, Theorem 9.10). 

Ilyin and Poznyak (1982, Theorem 9.8) prove what they call Dini’s test: if f 
is nonnegative and continuous on [a, oo) x [c, d), where —œ0 < c < d < +00, 
and for each y € [c,d], I(y) := SE f(x, y)dx < œ and /(-) is continuous, 
then the f(, y) for c < y < d are uniformly (improper Riemann) integrable; 
in this case they are also uniformly integrable in our sense. Of course, the same 
holds if f is nonpositive, which would apply to the derivative of a positive, 
decreasing function. 

The Weierstrass domination condition, Corollary A.3, is classical and 
appears in most of the listed references. Hobson (1926, p. 355) states it (for 
functions of real variables) and gives references to several works published 
between 1891 and 1910. 
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Lang (1983) mentions the convolution case, Corollary A.6. Lang also treats 
differentiation of integrals f i f(t, x)dt where x € E, f € F, and E, F are 
Banach spaces. 

Ilyin and Poznyak (1982, Theorem 9.5) and Kartashev and Rozhdestvenskii 
(1984) treat differentiation of integrals 7) f(x, y)dx. 

Brown (1986, Chapter 2) is a reference for Proposition A.16 and some mul- 
tidimensional extensions, for application to exponential families in statistics. 
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Multinomial Distributions 


Consider an experiment whose outcome is given by a point w of a set Q where 
(Q, A, P) is a probability space. Let A;, i = 1,...,m, be disjoint measurable 
sets with union Q. Independent repetition of the experiment n times is rep- 
resented by taking the Cartesian product Q” of n copies of (Q, A, P) (RAP, 
Theorem 4.4.6). Let X := {X;}'_, € Q” and P,(A) := 1 X; ôx (A), A€ A, 
so P, is an empirical measure for P, and nP,,(A;) is the number of times the 
event A; occurs in the n repetitions. 

A random vector {n aja is said to have a multinomial distribution for n 
observations, or with sample size n, and m bins or categories, with probabilities 
{pj}i-1 If pj 20, pi +---+ Pm = 1, and for any nonnegative integers kj 
with kj +--+ + km =n, 


n! 


oes ek a 
ree kal”! < Pm: (B.1) 


Prinj =kj, j=1,...,m} = 
Otherwise — if k; are not nonnegative integers, or do not sum to n — the 
probabilities are 0. 

Note that in a multinomial distribution, since the probability is only pos- 
itive when km =n — } i; mki, we have nm = n—)o,_,,ni- Also, Pm = 
1— er pi. So if pi,..., Pm—1 are nonnegative and their sum is less than or 
equal to 1, it also makes sense to say that {n ;};<m have a multinomial distribu- 
tion for m (not m — 1) bins and probabilities {p;}j<m if Pr{nj = kj, j <m} 
is given by the right side of (B.1) whenever k; are nonnegative integers whose 
sum is at most n, and where km and pm are determined as just described. 


Theorem B.1 For any probability space (Q, A, P) and any disjoint measur- 
able sets Aj, i =1,...,m, with union Q, {nP,(Aj)}i_, have a multinomial 
distribution for n observations and m categories with probabilities { P(Aj)}7_). 
The same holds if we take i = 1,...,m — 1. 
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Proof. Use induction on m. For m = 2, nP,,(A,) has a binomial distribution: 
specifically, n P,(A,) is the number of successes, and n P,(A2) the number 
of failures, in n independent trials with probability p = P(A,) of success on 
each trial. To evaluate Pr(n P,,(A,) = k), for k = 0, ..., n, there are (£) ways 
to choose k of the n trials. For each way, the probability that just these trials are 
successes is pa — py Then n P,,(A2) = n — k, and the stated result holds. 

For the induction step from m to m + 1, apply the case of m bins to A; U 
Az, A3,..-, Am+1. Then, given that n P,(Aı U A2) = kı + ka, the conditional 
probability that nı = kı and nz = k2 is binomial. Checking that 


(pi + pote ea pı y( p2 y _ Pip? 
(ki + ky)! ky Pit p2 Pit p2 ky tk! 


finishes the proof. 


Theorem B.2 Suppose {nj}? have a multinomial distribution for n observa- 
tions and m bins with probabilities { pj; my Let T be a subset of {1, ..., m}. 
Then 


(a) {nj}ier have a multinomial distribution for n observations and card(T) + 1 
bins, with probabilities {p;}ier. 
(b) Let S be a subset of {1, ..., m} disjoint from T. Then the conditional distri- 


bution, or law, L({nj}ies|{nj} jer), depends on {nj} jer only through Vier Nj, 


in other words for any integers k; = 0, j €T, 


L({nihies|nj = kj forall j € T) = LY {nijies 


2 n=} k 


jeT JEF 


k; 


(c) If SUT = {1,...,m}, this distribution is multinomial for n — ~ y 


observations in card(S) bins with probabilities p;/ X ics Pi, J € S. 


jEeT 


Proof. (a) We can assume that for some r, T = {r + 1, . . . , m}. The probability 
that n; = kj forr < j <m is a sum of multinomial probabilities, 


k; ; 
n! (Tpk) D Tey iets Kj K = Ki +-++ +k, =n—) kj 


j>r 
where «j are integers > 0. The sum ) (II...) equals (pi +--+ + pr) /K! by 
the multinomial theorem, and (a) follows. 


(b) We can assume S U T = {1,..., m} since the joint distribution of n;, i € S 

is determined by that ofn;, j ¢ T. Thus S = {1,...,r}. We need to evaluate 

conditional probabilities of the form 

P(n; = ki, i= 1,...,m) 
P(inj =k;j, j >r) j 


Pr(n; = ki, i=l,...,rlnj =k;, j>r) = 
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The numerator is a simple multinomial probability as in (B.1). The denominator 
is a multinomial probability by part (a). Carrying out the division, the pi /kj! 
terms for j > r cancel. The resulting expression, as claimed, depends on the 
k; for j > r only through their sum. Also, 


(c) It is of the stated multinomial form, by the proof of (a). 
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Appendix C 


Measures on Nonseparable Metric Spaces 


Let (S,d) be a metric space. Under fairly general conditions, to be given, 
any probability measure on the Borel sets will be concentrated in a separable 
subspace. 

The problem will be reduced to a problem about discrete spaces. An open 
cover of S will be a family {U,},.<7 of open subsets Ux of S, where I is any 
set, here called an index set, such that $ = ere Ua. An open cover {Vg} pes 
of S, with some index set J, will be called a refinement of {Ug}yer iff for all 
Ê € J there exists an œ € J with Vg C Uy. An open cover {Va}uer will be 
called o-discrete if I is the union of a sequence of sets 7, such that for each n 
anda Æ p in In, U« and Ug are disjoint. Recall that the ball B(x, r) is defined 
as {y E€ S: d(x, y) < r}. For two sets A, B we have d(A, B) := inf{d(x, y) : 
x E€ A, y € B}, and d(y, B) := d({y}, B) for a point y. 


Theorem C.1 For any metric space (S, d), any open cover {Ug}aer of S has 
an open o -discrete refinement. 


Proof. For any open set U C S let U, := {y : B(y, 2”) C U. Then y € U, if 
and only if d(y, S \ Un) > 27", and d(U,, S\ Ung) > 2" — 2! = 27”! 
by the triangle inequality. (The sets U,, are always closed, and not usually open.) 
Let Ux,n := (Ua )n. By well-ordering (e.g., RAP, Section 1.5), we can assume 
that the index set J is well-ordered by a relation <. For each positive integer 
nandeacha € I let Wan := Ue.n \ Uf{Up n41 : B <a}. For each œ Æ £ in I 
and each n, either Wan C S \ Ug,n+1 if B < aor Wen C S \ Vent ifa < B. 
In either case, d(Wa,n, Wg,n) = 2-7"! Let Vyn := {x : d(x, Wan) < Jm 
Then d(Va.n, Vpn) = 2-"-2. The sets Van are all open. For a given n and 
different values of a, they are disjoint. Let J := N x J and define V,, y € J, 
by Vina) := Vn,a- To show that {V,},<, is a cover of S, given any x € S, take 
the least æ such that x € Uy. Then for n large enough, x € Uy», and then 
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x € Wan C Van as desired. So {V,},<, is a o-discrete refinement of {Ug }ver, 
completing the proof. 


Cardinal numbers are defined in RAP, in the last part of Appendix A (as 
smallest ordinals with given cardinality). A cardinal number ¢ is said to be 
measurable if for a set S of cardinality ¢, there exists a probability measure P 
defined on all subsets of S which is nonatomic, in other words P({x}) = 0 for 
all x € S. If there is no such P, ¢ is said to be of measure 0. 

The continuum hypothesis implies that the cardinality c of the continuum 
(that is, of [0, 1]) is of measure 0 (RAP, Appendix C). 

The separability character of a metric space is the smallest cardinality of a 
dense subset. We have: 


Theorem C.2 Let (S, d) be a metric space. Let P be a probability measure on 
the o-algebra of Borel sets, generated by the open sets. Then either there is a 
separable subspace T with P(T) = 1, or the separability character ¢ of S is 
measurable. 


Proof. Let {xq}yer be dense in S, where J has cardinality ¢. For a fixed positive 
integer m, the balls Uy := B(x,, 1/m) form an open cover of S. Take an open, 
o-discrete refinement. Then S is the union of countably many open sets V,,, 
each of which is the union of open sets V,,, y € Ia, disjoint for different 
values of y, and each included in some ball of radius 2/m. Here each I’, has 
cardinality at most ¢. If a cardinal has measure 0, then so, clearly, do all smaller 
cardinals. Define a measure jz, on each I, by (A) := Ped Vy,n), Where 
P is defined on all open sets. So if ¢ has measure 0, then for each n, there 
is a countable subset G, of I, such that P(V,) = DPV) : y E€ Gn}. So 
P(Cm) = 1 where C,, is a countable union (over n and members of G, for each 
n) of balls of radius 2/m. Taking an intersection over m, we see that P(C) = 1 
for some separable subset C. 


It is consistent with the usual axioms of set theory (including the axiom of 
choice) that there are no measurable cardinals, in other words, all cardinals are 
of measure 0 (e.g., Drake 1974, pp. 67—68, 177—178). It is apparently unknown 
whether existence of measurable cardinals is consistent (Drake, 1974, pp. 185- 
186). So, for practical purposes, a probability measure defined on the Borel 
sets of a metric space is always concentrated in some separable subspace. 

Here is another fact giving separability: 


Theorem C.3 Let f be a Borel measurable function from a separable metric 
space S into a metric space T. Then, assuming the continuum hypothesis, f 
has separable range. 
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Proof. If the range of f is nonseparable, then for some r > 0, there is an 
uncountable set F of values of f any two of which are more than r apart. The 
continuum hypothesis implies that F has cardinality at least c. All the 2° or 
more subsets of F are closed, so all their inverse images under f are distinct 
Borel sets. On the other hand, in a separable metric space there are at most 
c Borel sets: the larger collection of analytic sets has cardinality at most c 
by the Borel isomorphism theorem and universal analytic set theorem (RAP, 
Theorems 13.1.1 and 13.2.4). So there is a contradiction. 


Notes on Appendix C. Theorem C.1 on o-discrete refinements is from Kelley 
(1955, Theorem 4.21 p. 129), who attributes it to A. H. Stone (1948). Theorem 
C.2 is due to Marczewski and Sikorski (1948), who prove it by a somewhat 
different method. 
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Appendix D 


An Extension of Lusin’s Theorem 


Lusin’s theorem says that for any measurable real-valued function f, on [0, 1] 
with Lebesgue measure A for example, and € > 0, there is a set A with A(A) < € 
such that restricted to the complement of A, f is continuous. Here [0, 1] can be 
replaced by any normal topological space and A by any finite measure u which 
is closed regular, meaning that for each Borel measurable set B, p(B) = 
sup{u(F): F closed, F C B} (RAP, Theorem 7.5.2). Recall that any finite 
Borel measure on a metric space is closed regular (RAP, Theorem 7.1.3). 

Proofs of Lusin’s theorem are often based on Egorov’s theorem (RAP, The- 
orem 7.5.1), which says that if measurable functions f, from a finite measure 
space to a metric space converge pointwise, then for any ¢ > 0 there is a set of 
measure less than £ outside of which the f, converge uniformly. 

Here, the aim will be to extend Lusin’s theorem to functions having values in 
any separable metric space. The proof of Lusin’s theorem in RAP, however, also 
relied on the Tietze—Urysohn extension theorem, which says that a continuous 
real-valued function on a closed subset of a normal space can be extended to 
be continuous on the whole space. Such an extension may not exist for some 
range spaces: for example, the identity from {0, 1} onto itself does not extend to 
a continuous function from [0, 1] onto {0, 1}; in fact, there is no such function 
since [0, 1] is connected. 

It turns out, however, that the Tietze—Urysohn extension and Egorov’s the- 
orem are both unnecessary in proving Lusin’s theorem: 


Theorem D.1 Let (X, T) be a topological space and n a finite, closed regular 
measure defined on the Borel sets of X. Let f be a Borel measurable function 
from x into S where (S, d) is a separable metric space. Then for any £ > O there 
is a closed set F with u(X \ F) < € such that f restricted to F is continuous. 


Proof. Let {Sn}n>1 be a countable dense set in S. For m = 1,2,..., and any 
x € X, let fm(x) = sn for the least n such that d( f(x), s1) < 1/m. Then fi, is 
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measurable and defined on all of X. For each m, let n(m) be large enough so 
that 


{x : d(f (x), S1) > 1/m forall n <n(m)} < 1/2”. 


Forn = 1,...,n(m), take a closed set Finn C fa {sn} with 


-1 
Ulfa {Sn} \ Finn) < mnm) 


by closed regularity. For each fixed m, the sets Fmn are disjoint for different 
values of n. Let Fm := Ua Finn. Then fm is continuous on Fm. By choice of 
n(m) and Finn, U(Fm) > 1 — 2/2”. 

Since d( fm, f) < 1/m everywhere, clearly fm —> f uniformly (so Egorov’s 
theorem is not needed). For r = 1,2,..., let H, := NZ, Fn. Then H, is 
closed and u(H,) > 1 — 4/2". Take r large enough so that 4/2” < £. Then f 


restricted to H, is continuous as the uniform limit of continuous functions fn 


on H, C Fm, m > r, so we can let F = H, to finish the proof. 


Notes on Appendix D. Lusin’s theorem for measurable functions with values in 
any separable metric space, on a space with a closed regular finite measure, was 
first proved as far as I know by Schaerf (1947), who proved it for f with values 
in any second-countable topological space (i.e. a space having a countable base 
for its topology), and for more general domain spaces (“neighborhood spaces”). 
See also Schaerf (1948) and Zakon (1965) for more extensions. 
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Bochner and Pettis Integrals 


Let (X, A, u) be a measure space and (S, || ||) a separable Banach space. A 
function f from X into S will be called simple or u-simple if it is of the form 
f= y 1,4, yi for some y; € S, k < oo and measurable A; with u(A;) < oo. 
For a simple function, the Bochner integral is defined by 


frau = Eman es 


Theorem E.1 The Bochner integral is well-defined for -simple functions, the 
-simple functions form a real vector space, and for any -simple functions 
f, g and real constant c, fcf +gdu=cffdut+fgdu. 


Proof. These facts are proved just as they are for real-valued functions (RAP, 
Proposition 4.1.4). 


For any measurable function g from X into S, where measurability is defined 
for the Borel o-algebra generated by the open sets of S, x > ||g(x)|| is a 
measurable, nonnegative real-valued function on S. 


Lemma E.2 For any two u-simple functions f and g from X into S, || f fdu] 
SSM flldu and || f fdu—fg dull < Sf — gil du. 


Proof. Since f — g is -simple and Theorem E.1 applies, it will be enough 
to prove the first statement. Note that f f du is a Bochner integral while 
S \ fldu is an integral of a real-valued function. Let f = $`; 14,u;, a finite 
sum. We can assume that the measurable sets A; are disjoint. Then || f f d|| = 


IU, MAdull < LU; HAdIall = SSI de. 


Let £'(X, A, u, S) := L! (X, u, S) be the space of all measurable functions 
f from X into S such that f || f|| du < oo. By the triangle inequality, it is easily 
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seen that L! (X, A, u, S) is a vector space. Also, all jz-simple functions belong 
to £'(X, A, u, S). Define ||-||; on L'CX, A, u, S) by || fla := SIF || du. It is 
easily seen that ||: ||ı is a seminorm on £!(X, A, u, S). On the vector space 
of j-simple functions, the Bochner integral is linear (by Theorem E.1) and 
continuous (also Lipschitz) for ||: ||ı by Lemma E.2. 


Theorem E.3 For any separable Banach space (S, ||’ ||) and any measure space 
(X, A, 2), the u-simple functions are dense for \\ ||; in £'(X, A, u, S) and the 
Bochner integral extends uniquely to a linear, real-valued function on the 
Banach space L'(X, A, u, S), continuous for \\ |li. 


Proof. A first step in the proof will be the following. 


Lemma E.4 For any f € L!(X, A, u, S) there exist -simple fy with || f — 
fillı > 0 and || f — fąl| — 0 almost everywhere for u as k — œ and such 
that || fxll < || f\| on X. 


Proof. Define a measure y on (X, A) by y(B) := Jz || f(x)|| du(x). Then y is 
a finite measure and has a finite image measure y o f~! on S. If f = 0 a.e. (u), 
so that y = 0,set fe = 0. So we can assume y(S) > 0. Given m = 1,2,..., 
by Ulam’s theorem (RAP, Theorem 7.1.4), there is a compact K := Km C S 
with y(f—'(S \ K)) < 1/2”. Then take a finite set F C K such that each point 
of K is within 1/m of some point of F. For each y € F, let s(y) := cy for 
a constant with 0 < c < 1 such that ||y — cy|| = 1/m, or s(y) = 0 if there is 
no such c (||y|| < 1/m). Let F := {y1,..., yg} for some distinct y;. We can 
assume yı := 0 € F C K. For y € K let h,,(y) = s(y;) such that || y — y;|| is 
minimized, choosing the smallest possible index 7 in case of ties. For y K 
set Am (y) := 0. Then for each y € S, [An )l] < Iyl, and lhm) — yll < 2/m 
for all y € K. Define a function gm by gm(x) := hm(f(x)). Then gm and Am 
are measurable and have finite range. Let ô := min ;>2 || y; ||. Then ô > 0. The 
set where g,, #0 and thus f € K is included in the set where || f|| > 6/2, 
which has finite measure for u. So gm is a u-simple function. The set where 
If — 8m|| > 2/m has y measure at most 1/2”. Then, by the Borel—Cantelli 
Lemma applied to the probability measure y/y(S), it follows that gn > f 
almost everywhere for y. For any x, if f(x) = 0, then g,,(x) = 0 for all m. It 
follows that ||g,. — f || — 0 almost everywhere for u. Since ||g,,|| < || || and 
llgn — || < 2||f||, it follows by dominated convergence that || g,, — f ||; > 0 
as m — OOo, proving the Lemma. 


Then to continue proving the Theorem, the Bochner integral, a linear func- 
tion, is continuous (and Lipschitz) for the seminorm ||'||;. Thus it is uniformly 
continuous on its original domain, the simple functions, and so has a unique 
continuous extension to the closure of its domain, which is £!(X, A, u, S), and 
the extension is clearly also linear. 
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So the Bochner integral is well-defined for all functions in £'(X, A, u, S), 
and only for such functions. A function from X into S will be called Bochner 
integrable if and only if it belongs to £L!(X, A, m, S). The extension of the 
Bochner integral to £'(X, A, u, S) will also be written as f - d u. Thus Theorem 
E.3 implies that 


fetesau=ef faut | edu (E.1) 


for any Bochner integrable functions f, g and real constant c. Also, by taking 
limits in Lemma E.2 it follows that 


| 1 fdull < f IfI du E2) 


for any Bochner integrable function f. 
Although monotone convergence is not defined in general Banach spaces, a 
form of dominated convergence holds: 


Theorem E.5 Let (X, A, u) be a measure space. Let f, be measurable func- 
tions from X into a Banach space S such that for all n, || fal| < g where 
g is an integrable real-valued function. Suppose f, converge almost every- 
where to a function f. Then f is Bochner integrable and || f f, — f dull < 
Sifa- fldu > Oasn ov. 


Proof. First, f is measurable (RAP, Theorem 4.2.2). Since || f|| < g, f is 
Bochner integrable, and the rest follows from (E.2) for fa — f and dominated 
convergence for real-valued functions || fa — f|| < 2g. 


A Bochner integral f g du = f f du can be defined when g is only defined 
almost everywhere for jz, f is Bochner integrable and g = f where g is defined, 
just as for real-valued functions (RAP, Section 4.3). It is easy to check that when 
S = R, the Bochner integral equals the usual Lebesgue integral. 

A Tonelli—Fubini theorem holds for the Bochner integral: 


Theorem E.6 Let (X, A, n) and (Y, B, v) be o-finite measure spaces. Let f 
be a measurable function from X x Y into a Banach space S such that 


T] Il Fx, ylldux)dv(y) < co. 


Then for u-almost all x, f(x,-) is Bochner integrable from Y into S; for 
v-almost all y, f(-, y) is Bochner integrable from X into S, and 


f faux v) = J | tæ panao) = J | te vaoa. 


Proof. By the usual Tonelli-Fubini theorem for real-valued functions (RAP, 
4.4.5), the assumption f || f(x, y)||du(x)du(y) < œ implies that for v-almost 
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all y, {Il f(&, lidu) < 00 and for u-almost all x, f || f(x, ylldv(y) < o. 
Let T F be the set of all f € £'(X x Y, u x v, S) such that 


f rau xv) = J | tæ panao) = J | tæ aroan 


Then by the usual Tonelli-Fubini theorem, all (u x v)-simple functions 
are in TF. For any f € LI(X x Y, u x v, S), by Lemma E.4, take sim- 
ple fa with || fall < Ifl and || fa — fil — 0 almost everywhere for u x 
v. Then for v-almost all y, fa(x, y) > f(x,y) for u-almost all x, and 
le I< IFE II € £'CX, u), so by dominated convergence (Theorem 
E.5), f fax, yydu(x) > f f(x, y)du(x). Since the functions of y on the left 
are v-measurable, so is the function on the right (RAP, Theorem 4.2.2, with a 
possible adjustment on a set of measure 0). For each y, || f f,(x, y)du(x)|| < 
Sfx, y)lidul(x), which is v-integrable with respect to y, so by dominated 
convergence (Theorem E.5) again, 


/ 1 AE Ee / I TE O ae n> 00: 


The same holds for the iterated integral in the other order, f f -dvdu, and 
S fad(u x v) > f fd(u x v) by (E.7), so f ETF. 


Now let (S, 7) be any topological vector space, in other words, S is a real 
vector space, J is a topology on S, and the operation (c, f, g)> cf + g is 
jointly continuous from R x S$ x S into S. Then the dual space S' is the set 
of all continuous linear functions from S into R. Let (X, A, u) be a measure 
space. Then a function f from X into S is called Pettis integrable with Pettis 
integral y € S if and only if for every t € S’, f t(f) du is defined and finite 
and equals t(y). 

The Pettis integral is also due to Gelfand and Dunford and might be called 
the Gelfand—Dunford—Pettis integral (see the Notes). The Pettis integral may 
lack interest unless $” separates points of S, as is true for normed linear spaces 
by the Hahn—Banach theorem (RAP, Corollary 6.1.5). 


Theorem E.7 For any measure space (X, A, u) and separable Banach space 
(S, |||), each Bochner integrable function f from X into S is also Pettis 
integrable, and the values of the integrals are the same. 


Proof. The equation t(f f du) = f t(f) du is easily seen to hold for simple 
functions and then, by Theorem E.3, for Bochner integrable functions. 


Example. A function can be Pettis integrable without being Bochner integrable. 
Let H be a separable, infinite-dimensional Hilbert space with orthonormal 
basis {e,}. Let uw(f = nen) := w(f =—nen) := pnp := n4, and f =0 
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otherwise. Then f || f|| du =2>°>, n—>/4 = +00, so f is not Bochner inte- 
grable. On the other hand for any x, with )>,, x2 < oo we have | > 2nxnPn| < 


(>> x2)'/?2(9> n-7/7)'/? < 00, and by symmetry the Pettis integral of f is 0. 


Example. The Tonelli—Fubini theorem, which holds for the Bochner integral 
(Theorem E.6), can fail for the Pettis integral. Let H be an infinite-dimensional 
Hilbert space and let &; ; be orthonormal for all positive integers i and j. 
Let U(x, y) := 2Z'&; j for G=D/2' < x < 7/2 and 2 < y <2, j= 
1,...,2!, and 0 elsewhere. Then it can be checked that U is Pettis integrable 
on [0, 1] x [0, 1] for Lebesgue measure but is not integrable with respect to y 
for fixed x. 


Notes on Appendix E. The treatment of the Bochner integral here is partly 
based on that of Cohn (1980). The following historical notes are mainly based 
on those of Dunford and Schwartz (1958). 

Graves (1927) defined a Riemann integral for suitable Banach-valued func- 
tions. Bochner (1933) defined his integral, a Lebesgue-type integral for Banach- 
valued functions. The definition of integral given by Pettis (1938) was published 
at the same time or earlier by Gelfand (1936, 1938) and is equivalent to one of 
the definitions of Dunford (1936). Gelfand (1936) is one of the first, perhaps 
the first, of many distinguished publications by I. M. Gelfand. Birkhoff’s inte- 
gral (1935) strictly includes that of Bochner and is strictly included in that of 
Gelfand—Dunford-Pettis. Price (1940), in a further extension, treats set-valued 
functions. Dunford and Schwartz (1958, Chapter 3) develop an integral when 
is only finitely additive. Facts such as the dominated convergence theorem 
still require countable additivity. 


12:2 


P1: KpB 


CUUS2019-APP-F CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


Appendix F 


Nonexistence of Some Linear Forms 


Recall that a real vector space V with a topology 7 is called a topological 
vector space if addition is jointly continuous from V x V to V for 7, and 
scalar multiplication (c, v) > cv is jointly continuous from R x V into V for 
T on V and the usual topology on R. If d is a metric on V, then (V, d) is called 
a metric linear space iff it is a topological vector space for the topology of d. 
First, it will be shown that any measurable linear form on a complete metric 
linear space is continuous. Then, it will be seen that the only measurable (and 
thus, continuous) linear form on L?” ([0, 1], 4) forO < p < 1 is zero. Recall that 
for a given o-algebra A of measurable sets, in our case the Borel o-algebra, 
a universally measurable set is one measurable for the completion of every 
probability measure on A (RAP, Section 11.5). The universally measurable 
sets form a o-algebra, and a function f is called universally measurable if for 
every Borel set B in its range, f—'(B) is universally measurable. 


Theorem F.1 Let (E,d) be a complete metric real linear space. Let u be a 
universally measurable linear form: E +> R. Then u is continuous. 


Proof. There exists a metric p, metrizing the d topology on E, such that 
p(x +v,y+v) = p(x, y) for all x, y, v € E, and |A| < 1 implies (0, Ax) < 
p(0, x) for all x € E (Schaefer 1966, Theorem I.6.1 p. 28). Then we have: 


Lemma F.2 If (E, d) is a complete metric linear space, then it is also complete 
for p. 


Proof. If not, let F be the completion of E for o. Because of the translation 
invariance property of p, the topological vector space structure of E extends to 
F. Then since (E, d) is complete, E is a Gs (a countable intersection of open 
sets) in F (RAP, Theorem 2.5.4). So, if E is not complete for p, F \ E is of 
first category in F, but F \ E includes a translate of E which is carried to E 


439 


12:30 


P1: KpB Trim: 6in x 9in Top: 0.5in Gutter: 0.664in 


CUUS2019-APP-F 


CUUS2019-Dudley 978 0 521 49884 5 September 24, 2013 


440 Appendix F: Nonexistence of Some Linear Forms 


by a homeomorphism (by translation) of F, so F \ E is of second category in 
F, a contradiction, so the Lemma is proved. 


Now to continue the proof of Theorem F.1, let (E, p) be a complete metric 
linear space where p has the stated properties. If u is not continuous, there are 
en E€ E withe, — Oand |u(e,)| > 1 forall n. There are constants cp — oo such 
that Caen — 0: foreach k = 1,2,..., thereisad, > 0 such thatif p(x, 0) < dx, 
then p(kx, 0) < 1/k. We can assume that 6; is decreasing as k increases. For 
some ng, p(x, 0) < ôg for n > ng. We can also assume that ng increases with 
k. Let cn = 1 for n < n; andc, =k for ng < n < nay, K=1,2,.... Then 
Cn have the claimed properties. So we can assume e, — 0 and |u(e,)| —> oo. 
Taking a subsequence, we can assume >, p(0, en) < ce. So it will be enough 
to show that if >, e(0, en) < oo then |u(e,)| are bounded. 

Let Q be the Cartesian product TIX ,/, where for each n, J, is a copy of 
[—1, 1]. (So Q is the closed unit ball of 2°.) For A := {àn}; € Q, the series 
h(a) := pe | An€n converges in E, since by the properties of p, 


n n 
p 0, $ Aje; < >> p(0,A;e;) < Eroe) > 
j=k j=k 


as n > k — œ, and (E, p) is complete. Also, h is continuous A the product 
topology 7 on Q. For all n, let 4, be the uniform law U[—1, 1] (one-half 
times Lebesgue measure on [—1, 1]) and let w := N°, un on Q. Then u 
is a regular Borel measure on the compact ene space (Q, 7). Thus 
the image measure u o h`! is a regular Borel measure on (E, p). Since u is 
universally measurable, it is u o h! measurable (that is, measurable for the 
completion of u o h~! on the Borel o-algebra). So U := uo h is measurable 
for the completion of u on the Borel o-algebra of (Q, T). For 0 < M < oo let 
Qm := {x €Q: |U(x)| < M}. Then Qm is u-measurable and u(Qy) t 1 
as M + +oo. Fora given n write Q = I, X Qm) where Qim) := Nj4,1;. Let 
Man) = Mjyntj. Then u = Un X Mn). For y E Qn) let Ou n) = {te In: 
(t, y) € Qm}. By the Tonelli—Fubini theorem, 


AE [ ORE TEA 


For s, t € Oun(y), U(s, y) — U(t, y) = (s — Hulen). So |U (s, y)| < M and 
IU(t,y)| < M imply |s — t| < 2M/|u(e,)|. (We can assume u(e,) #0 for 
all n.) So Un(Qm,n(y)) < M/|u(en)| (since un has density $+ ln) So 
u(Qm) < M/|u(en)|, and |u(en)| < M/u(Qm) < œ for all n if (Qm) > 0, 
which is true for some M large enough. So |u(e,)| are bounded, proving 
Theorem F.1. 
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For 0 < p < œ let £?((0, 1], A) be the space of all Lebesgue measur- 
able real-valued functions f on [0, 1] with i; | f(x)|?dx < oo. Let L? := 
LP[0, 1] := L?({0, 1], à) be the space of all equivalence classes of func- 
tions in £?([0, 1], 4) for equality a.e. (A). For 1 < p < ©, recall that L? is 
a Banach space with the norm || f ||, := (i | f(x)|Pdx)'/?. For 0 < p < 1, 
| - |p is not a norm. Instead, let p,(f,g) := Iq |f — g|?(x)dx. It will be 
shown that for 0 < p < 1, pp defines a metric on L?. For any a > 0 and 
b > 0, (a+b)? <a? + b?:tosee this, letr := 1/p > 1, u := a?, v := bP, 
and apply the Minkowski inequality in £”. The triangle inequality for p, fol- 
lows, and the other properties of a metric are clear. Also, pp has the properties 
of the metric p in the proof of Theorem F.1. To see that L? is complete for 
Pp, let {fn} be a Cauchy sequence. As in the proof for p > 1 (RAP, Theorem 
5.2.1) there is a subsequence f,,, converging for almost all x to some f(x), and 


Ppl fum» f) > 0, 80 Pp fn» f) > 9. 


Theorem F.3 For 0 < p < 1 there are no nonzero, universally measurable 
linear forms on L?[{0, 1]. 


Proof. Suppose u is such a linear form. Then by Theorem F.1, u is continuous. 
Now L![0, 1] C L?[0, 1] and the injection h from L! into L” is continuous. 
Thus, u o h is a continuous linear form on L!. So, for some g € L~[0, 1], 
u(h(f)) = So(fa(xydx for all f € L'[0, 1] (e.g., RAP, Theorem 6.4.1). Then 
we can assume that g > c > 0 ona set E with A(E) > 0 (replacing u by —u 
if necessary). Let fa =n ona set E, C E with A(E,,) = A(E)/n and fa =0 
elsewhere. Then fa — O in L? for 0 < p < 1 and u( fa) = So fngy(xydx > 
cà(E) > 0 for all n, a contradiction. 


Note that Theorem F.1 fails if the completeness assumption is omitted: let H 
be an infinite-dimensional Hilbert space and let h, be an infinite orthonormal 
sequence in H. Let E be the set of all finite linear combinations of the h,. 
Then there exists a linear form u on E with u(h,) = n for all n, and u is Borel 
measurable on E but not continuous. 


Notes on Appendix F. The proof of Theorem F.1 is adapted from a proof for 
the case that Æ is a Banach space. The proof is due to A. Douady, according 
to L. Schwartz (1966) who published it. The fact for Banach spaces extends 
easily to inductive limits of Banach spaces (for definitions see, e.g., Schaefer 
1966), or so-called ultrabornologic locally convex topological linear spaces, and 
from linear forms to linear maps into general locally convex spaces. Among 
ultrabornologic spaces are the spaces of Schwartz’s theory of distributions 
(generalized functions). 
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Separation of Analytic Sets; Borel Injections 


Recall that a Polish space is a topological space S$ metrizable by a metric for 
which S is complete and separable. Also, in any topological space X, the o- 
algebra of Borel sets is generated by the open sets. Two disjoint subsets A, C 
of X are said to be separated by Borel sets if there is a Borel set B C X such 
that A C B and C C X \ B. Recall that a set A in a Polish space Y is called 
analytic iff there is another Polish space X, a Borel subset B of X, and a Borel 
measurable function f from B into Y suchthatA = f[B] := {f(x): x € B} 
(e.g., RAP, Section 13.2). Equivalently, we can take f to be continuous and/or 
B = X and/or X =>, where N is the set of nonnegative integers with discrete 
topology and 5 the Cartesian product of an infinite sequence of copies of N, 
with product topology (RAP, Theorem 13.2.1). 


Theorem G.1 (Separation theorem for analytic sets) Let X be a Polish 
space. Then any disjoint analytic subsets A,C of X can be separated by 
Borel sets. 


Proof. First, it will be shown that: 


Lemma G.2 If C; for j = 1,2,..., and D are subsets of X such that for each 
j, C; and D can be separated by Borel sets, then Uj C; and D can be 
separated by Borel sets. 


Proof: For each j, take a Borel set B; such that D C B; and Cj C X \ Bj. 
Let B := ()j2, Bj. Then D C B and U2, C; C X \ B, so the Lemma is 
proved. 


Next, it will be shown that: 


Lemma G.3 If Cj and Dj for j =1,2,..., are subsets of X such that for 
every i and j, Ci and D; can be separated by Borel sets, then (J, Ci and 
Us D; can be separated by Borel sets. 
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Proof. By Lemma G.2, for each j, 72, C; and D; can be separated by Borel 
sets. Applying Lemma G.2 again gives Lemma G.3. 


To continue the proof of Theorem G.1, we can assume A and C are both 
nonempty. Then there exist continuous functions f, g from > into X such that 
fli] = A and g[3] = C. 

Suppose A and C cannot be separated by Borel sets. For k = 1, 2, ..., and 
m; EN, i=1,...,k, let 


> (mı, ..., Mmk) = {n = {nj}, €3: n; = m; fori = 1,..., k}. 


By Lemma G.3, there exist mı and nı such that f[5 (m,)] and g[> (nı)] 
cannot be separated by Borel sets. Inductively, for all k = 1,2,..., there 
exist mą and n such that f[> (m,...,m,)] and g[> (nı, ..., ng)] cannot be 
separated by Borel sets. Let m := {m;} L; n := UAE If f(m) Æ g(n), 
then there are disjoint open sets U, V with f(m) ¢U and g(n) € V. But 
then by continuity of f and g, for k large enough, f[> (m,...,m,x)] C U 
and g[> (n1, ..., nk)] C V, which gives a contradiction. So f(m) = g(n). But 
f(m)€ A, g(n)€ C, and ANC = 4, another contradiction, which proves 
Theorem G. 1. 


Lemma G.4 If X is a Polish space and Aj, j =1,2,..., are analytic in X, 
then UZ A; is analytic. 


Proof. For j = 1,2, ..., let fj be a continuous function from 5 onto A ;. For 
n := {n;i}; € define f(n) := fai Qni+ı}i>1). Then f is continuous from 


> onto UF, Aj. 


Corollary G.5 If X is a Polish space and A;, j =1,2,..., are disjoint ana- 
lytic subsets of X, then there exist disjoint Borel sets B; such that A; C B; for 
all j. 


Proof. For each j such that A; = Ø we can take B; = Ø. So we can assume all 
the A ; are nonempty. For each k = 1, 2, . . . , by Lemma G.4, A® := (Jj Aj 
is analytic. Then by Theorem G.1, there is a Borel set Cy, with Ag C Cy and 
A® C X \ Cy. Let By := Cı and for j > 2, B; := C; \ Ure; Cy. Then B; 
have the properties stated. 


Theorem G.6 Let S be a Polish space, Y a separable metric space, and A a 
Borel subset of S. Let f be a 1—1, Borel measurable function from A into Y. 
Then the range f(A] is a Borel subset of Y, and f—'! is a Borel measurable 
function from f [A] onto A. 


Proof. The following fact will be used. 
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Lemma G.7 Let S be a Polish space and Y a separable metric space. Let A 
be a nonempty Borel subset of S and f a 1-1, Borel measurable function from 
A into Y. Then there exists a Borel measurable function g from Y onto A such 
that g(f (x)) = x forall x in A. 


Proof. Let d metrize S where (S, d) is complete and separable. Let {y ay be 
dense in S. Choose xp € A. Recall that B(x,r) := {y : d(x, y) < r}. For each 
k= 1,2,...and j =1,2,..., let Ag; := AN B(yj, 1/Ņ\ Uiz B(yi, 1/k), 
where the union is empty for j = 1. Then for each k, the Az; for j = 1,2,... 
are disjoint Borel sets whose union is A, and for each j, d(x, y) < 2/k for any 
x, y € Ajj. Since f is 1-1, for each k, f[A,;] are disjoint analytic sets in Y. 
Thus by Corollary G.5 there exist disjoint Borel subsets Bg; of Y with f[Axj] C 
B,j, and with Bz; = Ø if Ay; = Ø. For each j such that Ag; is nonempty, choose 
a point x; E€ Agj. Define a function g% on Y by letting gx(y) := x4; if y € Bj 
for any j and g,(y) := xoif y ¢ Us B,;. Then gx is Borel measurable (e.g. 
RAP, Lemma 4.2.4). 

Since (S, d) is complete, the set G of y in Y such that 9,(y) converges in X 
is the same as the set on which {g;(y)},>1 is a Cauchy sequence, so 


G = MUND : age), 80) < 1/m}. 
m>1n>1k>n 

Since S is separable, d(-,-) is product measurable on S x S (e.g., RAP, 
Propositions 4.1.7 and 2.1.4). Thus G is a Borel set in Y. On G, let 
h(y) := limg+oo g(y). Then h is Borel measurable (RAP, Theorem 4.2.2). 
Let g(y) := h(y) if y € G and h(y) € A. Otherwise, let g(y) := xo. Then 
g is a Borel function from Y onto A. If x € A, then for each k, x € Aj; 
for some j, f(x) € Byj, and gx( f(x)) = xxj, So d(x, gx( f(x))) < 2/k. Thus 
ger( f(x) > x € A, so g(f(x)) = h( f (x)) = x, proving Lemma G.7. 


Now to prove Theorem G.6, we can assume A is nonempty. Take the function 
g from Lemma G.7. Thenif y € f[A], we have f(g(y)) = y, whileif y ¢ f[A], 
then f(g(y)) 4 y. So f[A]={y eY¥: f(e(y)) = y}. If e is a metric for Y, 
then f[A]={yeY: e(f(g(y)), y) = 0}. So by the product measurability 
of e (as for d above), f[A] is a Borel set in Y. Thus for any Borel set B C 
A, 8) = f[B]is Borel in Y,so f~! is Borel measurable, and Theorem 
G.6 is proved. 


Note on Appendix G. This Appendix is entirely based on Cohn (1980), Section 
8.3. 
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Young—Orlicz Spaces 


A convex, increasing function g from [0, co) onto itself will be called a Young— 
Orlicz modulus. Then g is continuous since it is increasing and onto. Let 
(X, S, u) be a measure space and g a Young—Orlicz modulus. Let £.(X, S, u) 
be the set of all real-valued measurable functions f on X such that 


fle = int {o> 0: J gf i/odulx) < 7 <œ. (H.I) 


Let L, be the set of equivalence classes of functions in £,(X, S, u) for equality 
almost everywhere (u). By monotone convergence, we have 


Proposition H.1 For any Young—Orlicz modulus g and any f € £,(X,S, u), 
ifO<c := |lflle <+, we have f(g(|f\/c))du = 1, in other words, the 
infimum in the definition of || f || is attained. Also, || f|| = O if and only if 
f = 0 almost everywhere for u. 


Next, we have 


Lemma H.2 For any Young—Orlicz modulus g, and any measurable functions 


So if fal € IF, then | falle t If lle < +0. 


Proof. For any c > 0, f (| fal/d)du t+ Sf eg(|f\/c)du by ordinary monotone 
convergence. Thus || falle ¢ ¢ for some ¢. Taking c = ¢ and using Proposition 
H.1 we get || fll < t, while clearly || f ||, > t. 


Next is a fact showing that (not surprisingly) convergence in ||- ||, norm 
implies convergence in measure (or probability): 


Lemma H.3 For any Young—Orlicz modulus g, and any € > 0, thereisa ô > 0 
such that if || f || < ô, then u(| f| > £) < e. 


Proof. For any 6 > 0, || fll < 6 is equivalent to f g(|f|/ô)du < 1 by Propo- 
sition H.1. Since g is onto [0, oo), there is some M < oo with g(M) > 1/e 
and M > 1. Thus || fl, < ô implies u(| f| > Mô) < £. So we can set ô := 
€/M. 
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Theorem H.4 For any measure space (X,S, u), L¿(X, S, u) is a Banach 
space. 


Proof. For any real number A ¥ O it is clear that a function f is in £,(X, S, m) 
if and only if Af is, with ||Af\l, = All| fllg by definition of ||'||,. If 
f.h €L£,(X,S, u) and 0 < à < 1, then since g is convex we have by Jensen’s 
inequality (RAP, 10.2.6) for any c,d > 0 


[eQe(E)+a-a5) anes f eE) iuta- fe) an 
H2) 


where some or all three integrals may be infinite. Applying this for c = 1 and 
h = 0 gives 


f sopan < à f Pan. (H.3) 


Clearly, if the inequality in (H.1) holds for some c > 0, it also holds for 
any larger c. It follows that Af + (1 —A)h € £,(X, S, u). For the triangle 
inequality, let c := ||f\l, andd := |hllp. If c = 0, then f = 0 ae. (u), 
so || f + Allg = ||Alle < c+ d, and likewise if d = 0. If c > 0 and d > 0, let 
à := c/(c + d) in (H.6). Applying Proposition H.1 to both terms on the right 
in (H.6) gives f g((f +h)/(c+d)du < 1, and so || f + hllg < c +d. So 
the triangle inequality holds and ||'||, is a seminorm on £,(X, S, u). Clearly 
it becomes a norm on L,(X,S, u). To see that the latter space is complete 
for ||"llp, let {fz} be a Cauchy sequence. By Lemma H.3, take 6; := ô for 
€ := ej := 1/2/. Take a subsequence fi) with || fi — facjylle < 5; for any 
i > k(j) and j = 1,2,.... Then frj) converges jz-almost everywhere, by the 
proof of the Borel—Cantelli Lemma, to some f. Then || fka — flle —> 0 as 
j —> © by Fatou’s Lemma applied to functions g(| fkg) — f|/c) for c > 27/. 
It follows that || fi — f Ile —> 0 asi — oo, completing the proof. 


Let ® be a Young—Orlicz modulus. Then it has one-sided derivatives 
as follows (RAP, Corollary 6.3.3): ¢(x) := ®(x+) := limy),(®(y) — 
@(x))/(y — x) exists for all x > 0, and d(x—) := (x+) := limyy. (P(x) — 
®(y))/(x — y) exists for all x > 0. As the notation suggests, for each x > 0, 
g(x—) = lim, OC), and ¢ is a nondecreasing function on [0, oo). Thus, 
o(x—) = (x) except for at most countably many values of x, where @ may 
have jumps with ¢(x) > ¢(x—). On any bounded interval, where ¢ is bounded, 
® is Lipschitz and so absolutely continuous. Thus since (0) = 0 we have 
P(x) = fő o(u)du for any x > 0 (e.g., Rudin 1974, Theorem 8.18). For any 
x > 0, d(x) > 0 since Ẹ is strictly increasing. 

If @ is unbounded, for 0 < y < oo let W(y) := O*(y) := inf{x > 0: 
p(x) > y}. Then w(0) = 0 and w is nondecreasing. Let V(y) := ie wi(t)dt. 
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Then W is convex and Y’ = w except on the at most countable set where y has 
jumps. Thus for each y > 0 we have y(y) > O and Y is also strictly increasing. 

For any nondecreasing function f from [0, oo) into itself, it is easily seen 
that for any x > O and u > 0, f~(u) > x if and only if f(t) < u for allt < x. 
It follows that (f€) (x) = f(x—) for all x > 0. Since a change in ġ or y on 
a countable set (of its jumps) does not change its indefinite integral ® or Y 
respectively, the relation between ® and W is symmetric. 

A Young—Orlicz modulus ® such that ¢ is unbounded and ¢(x)J0 as x0 
will be called an Orlicz modulus. Then w is also unbounded and y(y) > 0 for 
all y > 0, so W is also an Orlicz modulus. In that case ® and W will be called 
dual Orlicz moduli. For such moduli we have a basic inequality due to W. H. 
Young: 


Theorem H.5 (W. H. Young) Let ®, Y be any two dual Young—Orlicz moduli 
from [0, œ) onto itself: Then for any x, y > 0 we have 


xy < $x) + YO), 
with equality if x > 0 and y = $(x-). 


Proof. If x = 0 or y = 0, there is no problem. Let x > 0 and y > 0. Then 
(x) is the area of the region A: 0 < u < x, O < v < ġ(u) in the (u, v) plane. 
Likewise, Y(y) is the area of the region B: 0< v < y, 0<u < y(v). By 
monotonicity and right-continuity of ¢, u > W(v) is equivalent to ġ(u) > v, 
so u < W(v) is equivalent to d(u) < v, so AN B = Ø. The rectangle R,,y : 
O0<u<x,0<v< y is included in AU BUC, where C has zero area, and 
if y = ọ(x—), then Rx, = A U B up to a set of zero area, so the conclusions 
hold. 


One of the main uses of inequality H.5 is to prove an extension of the 
Rogers—-Hölder inequality to Young—Orlicz spaces: 


Theorem H.6 Let ® and Y be dual Orlicz moduli, and for a measure space 
(X, S, u) let f € Lo(X,S, u) and g € Ly(X, S, u). Then fg € L'(X, S, w) 
and f | fgldu < 2\lfllellglly. 

Proof. By homogeneity we can assume || flo = ||g|lv = 1. Then applying 


Proposition H.1 with c = 1 and Theorem H.5, we get {| fg|du(x) < 2, and 
the conclusion follows. 


Notes to Appendix H. W. H. Young (1912) proved his inequality (Theorem 
H.5) for smooth functions ®. Birnbaum and Orlicz (1931) apparently began 
the theory of “Orlicz spaces,” and W. A. J. Luxemburg defined the norms ||" ||o; 
see Luxemburg and Zaanen (1956). Krasnosel’skii and Rutitskii (1961) wrote 
a book on the topic. 
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Appendix I 


Modifications and Versions of 
Isonormal Processes 


Let T be any set and (Q, A, P) a probability space. Recall that a real-valued 
stochastic process indexed by T is a function (t, œ) œ> X,(@) from T x Q into 
R such that for each t € T, X,(-) is measurable from Q into R. A modification 
of the process is another stochastic process Y, defined for the same T and Q 
such that for each t, we have P(X, = Y,) = 1. A version of the process X, 
is a process Z,, t € T, for the same T but defined on a possibly different 
probability space (Q1, 8, Q) such that X, and Z, have the same laws, i.e., for 
each finite subset F of T, LX, her) = LU Z;}rer). Clearly, any modification 
of a process is also a version of the process, but a version, even if on the same 
probability space, may not be a modification. For example, for an isonormal 
process L on a Hilbert space H, the process M(x) := L(—x) is a version, but 
not a modification, of L. 

One may take a version or modification of a process in order to get better 
properties such as continuity. It turns out that for the isonormal process on 
subsets of Hilbert space, what can be done with a version can also be done by 
a modification, as follows. 


Theorem I.1 Let L be an isonormal process restricted to a subset C of Hilbert 
space. For each of the following properties, if there exists a version M of L 
with the property, there also is a modification N with the property. For each a, 
xt» M(x)\@) for x € C is: 

(a) bounded (b) uniformly continuous. 

Also, if there is a version with (a) and another with (b), then there is a modifi- 
cation N(-) having both properties. 


Proof. Let A be a countable dense subset of C. For each x € C, take x, € A 
with ||x, —x|| < 1/n? for all n. Then L(x,) —> L(x) a.s. Thus if we define 
N(x)(@) := lim sup, oo L(4n)(@), or 0 on the set of probability O where the 
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lim sup is infinite, then N is a modification of L. If (a) holds for M, it will 
also hold for L on A and so for N, and likewise for (b), since a uniformly 
continuous function L(-)(@) on A has a unique uniformly continuous extension 
to C given by N(-)(w). Since N(-) is the same in both cases, the last conclusion 
follows. 
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weighted intervals, 266 
non-Donsker classes, 172, 179, 266, 
391-415 
P-universal, 320 
stability of 
convex hull, 171 
sums, 171 
union of two, 162-163 
sufficient conditions for 
bracketing, 274-283 
differentiability classes, 289, 300 
Koltchinskii—Pollard entropy, 252 
sequences of functions, 163—168, 172 
Vapnik—Cervonenkis classes, 258, 260 
uniform, 322, 372 
universal, 322, 360-366, 372, 374, 389 
Donsker’s theorem, 6, 58, 172 
dual Banach space, 92, 93, 272 


edge (of a graph), 187 

Effros Borel structure, 233, 234, 238 

Egorov’s theorem, 144, 432 

ellipsoids, 128, 129, 171, 181, 364, 374, 
389 


empirical measure, 1, 133, 137, 144, 179, 242, 
426 
as statistic, 219, 235, 264, 319, 323 
bootstrap, 323, 324 
empirical process, 133-137, 159-162 
in one dimension, 3, 6, 56, 58 
envelope function, 205, 239, 265, 269, 362 
e-net, 8, 267 
equicontinuity, asymptotic, 159, 162, 174, 336 
equivalent measures, 216 
essential infimum or supremum, 66, 138, 141, 
173 
estimator, 262, 263, 264 
exchangeable, 235 
exponential family, 235, 425 


factorization theorem, 215, 217, 237 
filtration, 398 


Gaussian processes, 61 
see also pregaussian, 134 
GB-sets, 66, 94 
and metric entropy, 96, 129 
implied by GC-set, 88 
necessary conditions, 88, 96 
special classes 
ellipsoids, 128 
GC-sets, 66, 94 
and Gaussian processes, 124 
and metric entropy, 129 
criteria for, 90 
other metrics, 124 
implies GB-set, 88 
special classes 
random Fourier series, 128, 129 
sequences, 128, 129 
stability of 
convex hull, 90 
union of two, 94 
sufficient conditions 
metric entropy, 94, 96, 132 
unit ball of dual space, 92 
generic chaining, 108-117 
geometric mean, | 1 
Glivenko—Cantelli class, 144, 269-274 
envelope, integrability, 265 
examples, 281 
convex sets, 306 
lower layers, 317 
non-examples, 170, 179, 214, 281 
strong, 144 
sufficient conditions for 
bracketing (Blum—DeHardt), 270 
differentiability classes, 289, 300 
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unit ball (Mourier), 273 
Vapnik—Cervonenkis classes, 175 
uniform, 264, 348—360 
universal, 266, 267 
weak, 144 
not strong, 267 
Glivenko—Cantelli theorem, 2, 57 
graph, 187 


Hausdorff metric, 233, 284 
Hilbert space, 64 

Holder condition, 287, 410 
homotopy, 296, 297 


image admissible, 222, 223, 225, 227, 238 
Suslin, 234, 362 
image admissible Suslin, 230, 232, 236, 237, 
238, 262, 268 
incomparable, 186 
independent 
as sets, 202, 264 
events, sequences of, 164, 168, 173 
pieces, process with, 398 
random elements, 142 
independent events, 170, 172 
inequalities, 9-15 
Bernstein’s, 9, 10, 11, 58, 165, 172 
Chernoff, 13 
Hoeffding’s, 10, 11, 58 
Hoffmann-Jgrgensen, 329 
Jensen’s, 50, 328 
conditional, 72 
Komatsu’s, 99, 132 
Lévy, 14, 59, 143 
Ottaviani’s, 14, 59 
Slepian’s, 74, 76, 78 
with stars, 328, 329, 330, 332, 340 
Young, W. H., 447 
inner product, 63 
integral 
Bochner, 93 
Pettis, 93 
intensity measure, 171, 396 
isonormal process, 64, 65, 93, 94, 121, 122, 
124, 125, 134, 448-449 


Kolmogorov—Smirmov statistics, 323 
Koltchinskii—Pollard entropy, 239, 267 
Kuiper statistic, 323 


law, 62 

convergence in, 136 
law, probability measure, 3, 63, 332 
likelihood ratio, 218 
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linearly ordered by inclusion, 182, 184, 185, 
197, 205, 282 
Lipschitz functions, seminorm, 154, 253, 286, 
344 


loss function, 262 

lower layers, 300-305, 318, 361, 401-410, 
415 

lower semilattice, 138 

Lusin’s theorem, 148, 432, 433 


major set or class, 204 
Marczewski function, 224 
marginals, laws with given, 32, 160, 161 
Markov property, 7, 415 
for random sets, 400 
measurability, 58, 136, 213-238, 262, 267, 
429-433, 439-444 
measurable 
cover functions, 173 
cover of a set, 137, 335 
universally, 60, 148, 229, 230, 242, 247, 
439, 441 
vector space, 67, 68, 69, 70 
measurable cardinals, 430 
metric 
dual-bounded-Lipschitz, 157, 324, 
372 
Prokhorov, 56, 157, 160 
Skorokhod’s, 58 
metric entropy, 7—9 
of convex hulls, 366-372 
with bracketing, 269, 393 
metrization of convergence in law, 154, 155, 
324, 374 
Mills’ ratio, 99 
minimax risk, 262, 263 
minoration, Sudakov, 76 
mixed volume, 306, 307 
modification, 65, 179, 267, 268, 448-449 
monotone regression, 318 
multinomial distribution, 13, 396, 426-428 


Neyman-Pearson lemma, 98, 219 
node (of a graph), 187 
nonatomic measure, 235, 236, 430 
norm, 63 

bounded Lipschitz, 154, 288, 324 

supremum, 6, 15, 58, 70, 223, 234 
normal distribution or law, 345 

in finite dimensions, 64, 69, 79 

in infinite dimensions, 64, 68 

in one dimension, 67, 68, 131 
normed space, 63 
null hypothesis, 319 
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468 Index 


orthants, as VC class, 198 
outer probability, 143, 144, 324 


P-Donsker class, see Donsker class 
P-perfect, 146 
Pascal’s triangle, 176 
history, 211 
perfect function, 146, 147, 148, 149, 150, 154, 
320 
perfect probability space, 148 
Pettis integral, 93, 283, 434-438 
Poisson 
distributions, 13, 14, 57, 171, 331, 333, 396, 
406 
process, 171, 396, 398, 401, 414 
Poissonization, 331, 333, 391, 395—400, 415 
polar, of a set, 87, 127, 128 
Polish space, 32, 60, 161 
Pollard’s entropy condition, 252, 374 
polyhedra, polytopes, 309, 318, 410 
polynomials, 180 
polytope, 182 
portmanteau theorem, 154, 174 
pregaussian, 134, 135, 137, 159, 172, 260, 361 
uniformly, 373, 374 
prelinear, 88, 89, 135, 179 
pseudo-metric, 122 
pseudo-seminorm, 70 


quasi-order, 185 
quasiperfect, 148 


Rademacher variables, 10, 326 
random element, 136, 143, 149, 154, 157, 169, 
324 

RAP, xii 
realization 

a. s. convergent, 149-153 

of a stochastic process, 89, 90 
reflexive Banach space, 93 
regular conditional probabilities, 161 
reversed (sub)martingale, 243, 245, 249 
risk, 262, 263 


sample, 319, 325 
-bounded, 94 
-continuous, 7, 94, 97, 122, 129, 134 
function, 66, 123, 124 
modulus, 94, 95 
space, 133, 338 
Sauer’s Lemma, 176, 185, 189, 190, 194, 206, 
209, 250 
selection theorem, 230, 231, 238 
seminorm, 63 


separable measurable space, 222 
shatter, 175 
o-algebra 

Borel, 63, 68, 71, 121 

product, 133, 140, 223, 226 

tail, 72, 91 
slowly varying function, 395 
standard deviation, 76 
standard normal law, 67, 83, 87, 95 
stationary process, 248 
statistic, 215, 218, 219, 345, 347 
Stirling’s formula, 13 
stochastic process, 62, 64, 94, 121, 122, 124, 

132, 262, 398, 448 

stopping set, 391, 398, 400, 414 
Strassen’s theorem, 56, 174 
subadditive process, 247, 248 
subgraph, 204, 285, 400 
submartingale, 72, 242, 243 
Sudakov—Chevet theorem, 76, 83, 88 
sufficient 

o-algebra, 215-222 

collection of functions, 220 

statistic, 215, 218, 219, 235 
superadditive process, 247, 248 
supremum norm, 71 
Suslin property, 229, 230, 234, 262, 268 
symmetric random element, 142 
symmetrization, 241, 249, 250, 267, 327, 330 


tail event or o-algebra, 72, 91 
tail probability, 92 
tetrahedra, 266 
tight, 148 
uniformly, 148 
tightness, 60 
topological vector space, 68, 70, 439 
tree, 187 
triangle function, 128 
triangular arrays, 325, 333 
triangulation, 309, 318 
truncation, 10 
two-sample process, 172, 319-323 


u.m., see universally measurable, 148 

uniform distribution on [0, 1], 161 

uniformly integrable, 420, 421, 424 

universal class a function, 226 

universally measurable, 60, 148, 229, 230, 
242, 247, 439 

upper integral, 136, 137 


Vapnik—Cervonenkis classes 
of functions 
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hull, 204, 207, 208, 361, 372 
major, 204, 208, 209, 361 
subgraph, 204, 258, 361, 362, 372 
of sets, 176-204, 240, 258, 260, 263, 264, 
360 
VC, see Vapnik—Cervonenkis class 
VC dimension, 176 
vector space 
measurable, 67, 68, 69, 70 
topological, 68, 70, 437 
version, 64, 121, 179, 448-449 


Index 469 


version-continuous, 121, 122, 124, 125, 
128 
volume, 129, 299, 306, 365, 393 


weak* topology, 93 
Wiener process, 3 


Young—Orlicz norms, modulus, 97, 130, 
445—447 


zero-one law for Gaussian measures, 69 
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